Office Action Analysis: 18612079 — METHOD AND SYSTEM FOR EVALUATION OF CODE GENERATION BY LARGE LANGUAGE MODEL

Office Action

§103 §DP
DETAILED ACTION
This is the initial Office action based on the application filed on March 21, 2024.
Claims 1-18 are pending.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Drawings
The drawings are objected to because Figures 6-9 are blurry and hard to read.  Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure number of an amended drawing should not be labeled as “amended.” If a drawing figure is to be canceled, the appropriate figure must be removed from the replacement sheet, and where necessary, the remaining figures must be renumbered and appropriate changes made to the brief description of the several views of the drawings for consistency. Additional replacement sheets may be necessary to show the renumbering of the remaining figures. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.

Specification
The disclosure is objected to because of the following informalities: 
Paragraph [0044], line 6, recites “and/or processor 110.” It should read – and/or processor 104 --.
Appropriate correction is required.

Claim Objections
Claim 13 is objected to because of the following informalities: 
Claim 13, line 3, recites “included in the first instruction with respect to the first task.” It should read -- included in the first set of instructions with respect to the first task --.
Appropriate correction is required.

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The filing of a terminal disclaimer by itself is not a complete reply to a nonstatutory double patenting (NSDP) rejection. A complete reply requires that the terminal disclaimer be accompanied by a reply requesting reconsideration of the prior Office action. Even where the NSDP rejection is provisional the reply must be complete. See MPEP § 804, subsection I.B.1. For a reply to a non-final Office action, see 37 CFR 1.111(a). For a reply to final Office action, see 37 CFR 1.113(c). A request for reconsideration while not provided for in 37 CFR 1.113(c) may be filed after final for consideration. See MPEP §§ 706.07(e) and 714.13.
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The actual filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/apply/applying-online/eterminal-disclaimer.

Claims 1, 2, 9, 10, 17, and 18 are rejected on the ground of nonstatutory obviousness-type double patenting as being unpatentable over Claims 1, 7, 8, 14, and 15 of co-pending Application No. 18/652,302 (hereinafter “‘302”) in view of US 2024/0028312 (hereinafter “Gillman”) and US 2025/0258723 (hereinafter “Liu”). 
It is noted that the instant application is an earlier filed co-pending application and ‘302 is a later filed co-pending application. It is also noted that both ‘302 and the instant application were filed by the same applicant. Claims 1, 7, 8, 14, and 15 of ‘302 recite some of the limitations of Claims 1, 2, 9, 10, 17, and 18 of the instant application. Although the conflicting claims are not identical, they are not patentably distinct from each other because Claims 1, 2, 9, 10, 17, and 18 of the instant application define an obvious variation of the invention claimed in ‘302. For illustrative purposes, a detailed analysis for Claim 1 of the instant application is provided.
The “receiving” steps and “providing” step recited within Claim 1 of ‘302 is an obvious variant of some of the limitations within the “receiving” steps and “providing” step of Claim 1 of the instant application. Receiving/providing a first request for performing a task, which is not patentably distinct from receiving/providing a first set of instructions for performing a first task, reasonably includes instructions. Moreover, the “performing” step recited within Claim 1 of ‘302 is an obvious variant of some of the limitations recited within the “executing” and “evaluating” steps of Claim 1 of the instant application since executing the first set of executable code and checking whether the first task has been successfully completed is not patentably distinct from evaluating a quality of the first set of executable code; checking if the task was successfully completed includes evaluating the quality of the code. Additionally, Claim 7 of ‘302 also explicitly states evaluating a quality of the first set of executable code. However, Claim 1 of the instant application recites further limitations, within the aforementioned steps, of “a first set of instructions for performing a first task and generating a first output,” “providing a list of available application programming interfaces (APIs) together with a submission of a request to the first LLM to select one API and to generate a first set of executable code based on the first set of instructions,” “receiving a selection of the one API,” and “executing the first set of executable code in order to […] generate the first output.”
As per Claim 1 of the instant application, Gillman discloses:
a first set of instructions for generating a first output (paragraph [0082], “[…] the method 600 includes receiving, by a machine learning engine, a user-specified data set and a natural language description of a user-requested data transformation task [first set of instructions] for execution with a subset of the user-specified data set (602) (emphasis added).”; paragraph [0081], “The method 600 includes executing, by the machine learning engine, the at least one candidate executable computer code to generate a transformation result [first output] (608) (emphasis added).”; paragraph [0075], “By way of example, the system 100 may include functionality for receiving a natural language description of a data manipulation for data in a spreadsheet […] generate executable code using the natural language description and automatically (e.g., without human intervention) execute the code to complete the described data manipulation (emphasis added).”);
providing [instructions] together with a submission of a request to the first LLM to generate a first set of executable code based on the first set of instructions (paragraph [0081], “The method 600 includes directing, by the machine learning engine, a large language model to generate at least one candidate executable computer code for performing the user-requested data transformation task (604) (emphasis added).”; paragraph [0078], “The method may include generating, by the large language model, executable computer code in a computer programming language specified in the natural language description of the user-requested data transformation task (emphasis added).”) [Examiner’s Remarks: Note that Gillman discloses the machine learning engine directing a large language model to generate executable computer code for performing the user-requested task. One of ordinary skill in the art would readily comprehend that in order for the large language model to generate the executable code that performs the user-requested task, the machine learning engine (processor) had to have provided the large language model with the user-requested task (first set of instructions) as an input together with a request to generate executable code based on the task (first set of instructions) while directing the large language model.];
executing the first set of executable code in order to […] generate the first output (paragraph [0081], “The method 600 includes directing, by the machine learning engine, a large language model to generate at least one candidate executable computer code for performing the user-requested data transformation task (604) […] The method 600 includes executing, by the machine learning engine, the at least one candidate executable computer code to generate a transformation result (608) (emphasis added).”).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of ‘302 to incorporate the teaching of Gillman into ‘302 to include “receiving, by the at least one processor, a first set of instructions for performing a first task and generating a first output; providing [instructions] together with a submission of a request to the first LLM to generate a first set of executable code based on the first set of instructions; executing the first set of executable code in order to perform the first task and generate the first output.” The modification would be obvious because one of ordinary skill in the art would be motivated to utilize a “machine learning engine with functionality for interacting with a large language model to generate and execute validated computer code to perform user-specified task described in a natural language (in contrast to, for example, a computer language)” since it “provides a technical improvement over conventional systems” that do not combine the functionality of a machine learning engine and LLM and “do not typically wait to generate models until after they have received the data and encoded it, and which do not typically select the encoders and the machine learning models to generate and train based on characteristics of both tasks and data” (Gillman, paragraphs [0063 & 0080]).
The combination of ‘302 and Gillman discloses “providing, by the at least one processor as an input to a first large language model (LLM), the first set of instructions, together with a submission of a request to the first LLM to generate a first set of executable code based on the first set of instructions” and “receiving, by the at least one processor from the first LLM, the first set of executable code,” but does not explicitly disclose:
providing, by the at least one processor as an input to a first large language model (LLM), a list of available application programming interfaces (APIs) and the first set of instructions, together with a submission of a request to the first LLM to select one API and to generate a first set of executable code based on the first set of instructions;
receiving, by the at least one processor from the first LLM, a selection of the one API and the first set of executable code.
However, Liu discloses:
providing, as an input to a first large language model (LLM), a list of available application programming interfaces (APIs), together with a submission of a request to the first LLM to select one API (paragraph [0006], “The first textual prompt for synthesizing user instructions can include, for instance, the list of APIs, the API documents for the list of APIs (or content of the API documents), and a request (in natural language) to generate a user instruction for performing a task using [selecting] a single tool/API from the list of APIs. The request, for instance, can be “generate a user instruction for performing a task using a single API from the provided APIs” (emphasis added).”; paragraph [0011], “During each of the one or more iterations, the first textual prompt for synthesizing user instructions can be processed as input, using the first trained LLM, to generate a corresponding synthetic natural language user instruction that describes a corresponding single-tool (e.g., single-API) task to be performed (emphasis added).”; paragraph [0087], “[…] processing a first textual prompt as input, using the first LLM, to generate a first synthetic natural language user instruction that describes to perform a first task. The first synthetic natural language user instruction may or may not identify a particular API which is the only API to be used to perform the first task (emphasis added).”);
receiving a selection of the one API (paragraph [0020], “Optionally, in some implementations, the m.sup.th synthetic natural language user instruction (A.sub.m) that describes the m.sup.th single-API task to be performed, can be processed iteratively using the second trained LLM, to generate the list of execution steps (emphasis added).”; paragraph [0068], “[…] in some cases, a synthetic natural language user instruction that describes a corresponding single-API task to be performed can explicitly identify the single API via which the single-API task is to be performed (emphasis added).”).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of ‘302 to incorporate the teaching of Liu into the combined teachings of ‘302 and Gillman to include “providing, by the at least one processor as an input to a first large language model (LLM), a list of available application programming interfaces (APIs) and the first set of instructions, together with a submission of a request to the first LLM to select one API and to generate a first set of executable code based on the first set of instructions; receiving, by the at least one processor from the first LLM, a selection of the one API and the first set of executable code.” The modification would be obvious because one of ordinary skill in the art would be motivated to have a LLM select an API from a list of APIs to ensure the correct one is used for performing the task and to use in generating user instructions for performing API-based tasks (synthetic training data) that can be used in ensuring “the diversity of the training data to train an LLM in possessing or improving its capability in handling tasks that utilize external tools or APIs” (Liu, paragraphs [0003, 0077, & 0087]).

A claim chart follows with the corresponding limitations between Claims 1 and 7 of ‘302 and Claim 1 of the instant application in bold.

Co-Pending Application No. 18/652,302
Instant Application No. 18/612,079
1. A method for automatically generating software code, the method being implemented by at least one processor, the method comprising: receiving, by the at least one processor, a first request for performing a first task and a first prompt that includes at least one from among application programming interface (API) information and at least one pre-defined helper function; providing, by the at least one processor as an input to a first large language model (LLM), the first request and a response to the first prompt that is received via a user interface (UI); receiving, by the at least one processor from the first LLM, a first set of executable code that implements a first function that corresponds to a skill that is usable for performing the first task; generating, by the at least one processor, a first test that relates to the first task; performing the first test by executing the first set of executable code and checking whether the first task has been successfully completed; and when the first task has been successfully completed, storing the first set of executable code as a skill in a skills library.

7. The method of claim 1, further comprising evaluating a quality of the first set of executable code with respect to at least one from among an accuracy of the first set of executable code, a robustness of the first set of executable code, and a consistency of the first set of executable code.
1. A method for evaluating code quality, the method being implemented by at least one processor, the method comprising: receiving, by the at least one processor, a first set of instructions for performing a first task and generating a first output; providing, by the at least one processor as an input to a first large language model (LLM), a list of available application programming interfaces (APIs) and the first set of instructions, together with a submission of a request to the first LLM to select one API and to generate a first set of executable code based on the first set of instructions; receiving, by the at least one processor from the first LLM, a selection of the one API and the first set of executable code; executing, by the at least one processor, the first set of executable code in order to perform the first task and generate the first output; and evaluating, by the at least one processor, a quality of the first set of executable code.


A claim chart follows with the corresponding limitations between Claims 2, 9, 10, 17, and 18 of the instant application and Claims 7, 8, 14 and 15 of ‘302 in bold. A detailed analysis is not shown for the purpose of brevity.

Co-Pending Application No. 18/652,302
Instant Application No. 18/612,079
7. The method of claim 1, further comprising evaluating a quality of the first set of executable code with respect to at least one from among an accuracy of the first set of executable code, a robustness of the first set of executable code, and a consistency of the first set of executable code.
2. The method of claim 1, wherein the evaluating of the quality of the first set of executable code comprises evaluating at least one from among an accuracy of the first set of executable code, a robustness of the first set of executable code, and a consistency of the first set of executable code.
8. A computing apparatus for automatically generating software code, the computing apparatus comprising: a processor; a memory; a display; and a communication interface coupled to each of the processor, the memory, and the display, wherein the processor is configured to: receive, via the communication interface, a first request for performing a first task and a first prompt that includes at least one from among application programming interface (API) information and at least one pre-defined helper function; provide, as an input to a first large language model (LLM), the first request and a response to the first prompt that is received via a user interface (UI); receive, via the communication interface from the first LLM, a first set of executable code that implements a first function that corresponds to a skill that is usable for performing the first task; generate a first test that relates to the first task; perform the first test by executing the first set of executable code and checking whether the first task has been successfully completed; and when the first task has been successfully completed, store the first set of executable code as a skill in a skills library.

14. The computing apparatus of claim 8, wherein the processor is further configured to evaluate a quality of the first set of executable code with respect to at least one from among an accuracy of the first set of executable code, a robustness of the first set of executable code, and a consistency of the first set of executable code.
9. A computing apparatus for evaluating code quality, the computing apparatus comprising: a processor; a memory; and a communication interface coupled to each of the processor and the memory, wherein the processor is configured to: receive, via the communication interface, a first set of instructions for performing a first task and generating a first output; provide, as an input to a first large language model (LLM), a list of available application programming interfaces (APIs) and the first set of instructions, together with a submission of a request to the first LLM to select one API and to generate a first set of executable code based on the first set of instructions; receive, from the first LLM, a selection of the one API and the first set of executable code; execute the first set of executable code in order to perform the first task and generate the first output; and evaluate a quality of the first set of executable code.
14. The computing apparatus of claim 8, wherein the processor is further configured to evaluate a quality of the first set of executable code with respect to at least one from among an accuracy of the first set of executable code, a robustness of the first set of executable code, and a consistency of the first set of executable code.
10. The computing apparatus of claim 9, wherein the processor is further configured to evaluate the quality of the first set of executable code by performing at least one from among an evaluation of an accuracy of the first set of executable code, an evaluation of a robustness of the first set of executable code, and an evaluation of a consistency of the first set of executable code.
15. A non-transitory computer readable storage medium storing instructions for automatically generating software code, the storage medium comprising a first set of executable code which, when executed by a processor, causes the processor to: receive a first request for performing a first task and a first prompt that includes at least one from among application programming interface (API) information and at least one pre-defined helper function; provide, as an input to a first large language model (LLM), the first request and a response to the first prompt that is received via a user interface (UI); receive, from the first LLM, a second set of executable code that implements a first function that corresponds to a skill that is usable for performing the first task; generate a first test that relates to the first task; perform the first test by executing the second set of executable code and checking whether the first task has been successfully completed; and when the first task has been successfully completed, store the second set of executable code as a skill in a skills library.
17. A non-transitory computer readable storage medium storing instructions for evaluating code quality, the storage medium comprising a first set of executable code which, when executed by a processor, causes the processor to: receive a first set of instructions for performing a first task and generating a first output; provide, as an input to a first large language model (LLM), a list of available application programming interfaces (APIs) and the first set of instructions, together with a submission of a request to the first LLM to select one API and to generate a second set of executable code based on the first set of instructions; receive, from the first LLM, a selection of the one API and the second set of executable code; execute the second set of executable code in order to perform the first task and generate the first output; and evaluate a quality of the second set of executable code.
14. The computing apparatus of claim 8, wherein the processor is further configured to evaluate a quality of the first set of executable code with respect to at least one from among an accuracy of the first set of executable code, a robustness of the first set of executable code, and a consistency of the first set of executable code.
18. The storage medium of claim 17, wherein when executed by the processor, the first set of executable code further causes the processor to evaluate the quality of the second set of executable code by evaluating at least one from among an accuracy of the second set of executable code, a robustness of the second set of executable code, and a consistency of the second set of executable code.


Thus, Claims 1, 2, 9, 10, 17, and 18 of the instant application are obvious over Claims 1, 7, 8, 14, and 15 of ‘302.
This is a provisional nonstatutory double patenting rejection.

Claims 1, 9, and 17 are rejected on the ground of nonstatutory obviousness-type double patenting as being unpatentable over Claims 1, 10, and 19 of co-pending Application No. 18/675,688 (hereinafter “‘688”) in view of US 2024/0028312 (hereinafter “Gillman”) and US 2025/0258723 (hereinafter “Liu”). 
It is noted that the instant application is an earlier filed co-pending application and ‘688 is a later filed co-pending application. It is also noted that both ‘688 and the instant application were filed by the same applicant. Claims 1, 10, and 19 of ‘688 recite some of the limitations of Claims 1, 9, and 17 of the instant application. Although the conflicting claims are not identical, they are not patentably distinct from each other because Claims 1, 9, and 17 of the instant application define an obvious variation of the invention claimed in ‘688. For illustrative purposes, a detailed analysis for Claim 1 of the instant application is provided.
The “receiving” steps and “providing” step recited within Claim 1 of ‘688 is an obvious variant of some of the limitations within the “receiving” steps and “providing” step of Claim 1 of the instant application. Receiving/providing a first request for performing a task, which is not patentably distinct from receiving/providing a first set of instructions for performing a first task, reasonably includes instructions. However, Claim 1 of the instant application recites further limitations of “a first set of instructions for performing a first task and generating a first output,” “providing a list of available application programming interfaces (APIs) together with a submission of a request to the first LLM to select one API and to generate a first set of executable code based on the first set of instructions,” “receiving a selection of the one API,” “executing, by the at least one processor, the first set of executable code in order to perform the first task and generate the first output,” and “evaluating, by the at least one processor, a quality of the first set of executable code.”
As per Claim 1 of the instant application, Gillman discloses:
a first set of instructions for generating a first output (paragraph [0082], “[…] the method 600 includes receiving, by a machine learning engine, a user-specified data set and a natural language description of a user-requested data transformation task [first set of instructions] for execution with a subset of the user-specified data set (602) (emphasis added).”; paragraph [0081], “The method 600 includes executing, by the machine learning engine, the at least one candidate executable computer code to generate a transformation result [first output] (608) (emphasis added).”; paragraph [0075], “By way of example, the system 100 may include functionality for receiving a natural language description of a data manipulation for data in a spreadsheet […] generate executable code using the natural language description and automatically (e.g., without human intervention) execute the code to complete the described data manipulation (emphasis added).”);
providing [instructions] together with a submission of a request to the first LLM to generate a first set of executable code based on the first set of instructions (paragraph [0081], “The method 600 includes directing, by the machine learning engine, a large language model to generate at least one candidate executable computer code for performing the user-requested data transformation task (604) (emphasis added).”; paragraph [0078], “The method may include generating, by the large language model, executable computer code in a computer programming language specified in the natural language description of the user-requested data transformation task (emphasis added).”) [Examiner’s Remarks: Note that Gillman discloses the machine learning engine directing a large language model to generate executable computer code for performing the user-requested task. One of ordinary skill in the art would readily comprehend that in order for the large language model to generate the executable code that performs the user-requested task, the machine learning engine (processor) had to have provided the large language model with the user-requested task (first set of instructions) as an input together with a request to generate executable code based on the task (first set of instructions) while directing the large language model.];
executing, by the at least one processor, the first set of executable code in order to perform the first task and generate the first output (paragraph [0081], “The method 600 includes directing, by the machine learning engine, a large language model to generate at least one candidate executable computer code for performing the user-requested data transformation task (604) […] The method 600 includes executing, by the machine learning engine [processor], the at least one candidate executable computer code to generate a transformation result (608) (emphasis added).”); and
evaluating, by the at least one processor, a quality of the first set of executable code (paragraph [0081], “The method 600 includes performing, by the machine learning engine [processor], at least one validation [quality] check on the at least one candidate executable computer code (606) (emphasis added).”).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of ‘688 to incorporate the teaching of Gillman into ‘688 to include “receiving, by the at least one processor, a first set of instructions for performing a first task and generating a first output; providing [instructions] together with a submission of a request to the first LLM to generate a first set of executable code based on the first set of instructions; executing, by the at least one processor, the first set of executable code in order to perform the first task and generate the first output; and evaluating, by the at least one processor, a quality of the first set of executable code.” The modification would be obvious because one of ordinary skill in the art would be motivated to utilize a “machine learning engine with functionality for interacting with a large language model to generate and execute validated computer code to perform user-specified task described in a natural language (in contrast to, for example, a computer language)” since it “provides a technical improvement over conventional systems” that do not combine the functionality of a machine learning engine and LLM and “do not typically wait to generate models until after they have received the data and encoded it, and which do not typically select the encoders and the machine learning models to generate and train based on characteristics of both tasks and data” (Gillman, paragraph [0063 & 0080]).
The combination of ‘688 and Gillman discloses “providing, by the at least one processor as an input to a first large language model (LLM), the first set of instructions, together with a submission of a request to the first LLM to generate a first set of executable code based on the first set of instructions” and “receiving, by the at least one processor from the first LLM, the first set of executable code,” but does not explicitly disclose:
providing, by the at least one processor as an input to a first large language model (LLM), a list of available application programming interfaces (APIs) and the first set of instructions, together with a submission of a request to the first LLM to select one API and to generate a first set of executable code based on the first set of instructions;
receiving, by the at least one processor from the first LLM, a selection of the one API and the first set of executable code.
However, Liu discloses:
providing, as an input to a first large language model (LLM), a list of available application programming interfaces (APIs), together with a submission of a request to the first LLM to select one API (paragraph [0006], “The first textual prompt for synthesizing user instructions can include, for instance, the list of APIs, the API documents for the list of APIs (or content of the API documents), and a request (in natural language) to generate a user instruction for performing a task using [selecting] a single tool/API from the list of APIs. The request, for instance, can be “generate a user instruction for performing a task using a single API from the provided APIs” (emphasis added).”; paragraph [0011], “During each of the one or more iterations, the first textual prompt for synthesizing user instructions can be processed as input, using the first trained LLM, to generate a corresponding synthetic natural language user instruction that describes a corresponding single-tool (e.g., single-API) task to be performed (emphasis added).”; paragraph [0087], “[…] processing a first textual prompt as input, using the first LLM, to generate a first synthetic natural language user instruction that describes to perform a first task. The first synthetic natural language user instruction may or may not identify a particular API which is the only API to be used to perform the first task (emphasis added).”);
receiving a selection of the one API (paragraph [0020], “Optionally, in some implementations, the m.sup.th synthetic natural language user instruction (A.sub.m) that describes the m.sup.th single-API task to be performed, can be processed iteratively using the second trained LLM, to generate the list of execution steps (emphasis added).”; paragraph [0068], “[…] in some cases, a synthetic natural language user instruction that describes a corresponding single-API task to be performed can explicitly identify the single API via which the single-API task is to be performed (emphasis added).”).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of ‘688 to incorporate the teaching of Liu into the combined teachings of ‘688 and Gillman to include “providing, by the at least one processor as an input to a first large language model (LLM), a list of available application programming interfaces (APIs) and the first set of instructions, together with a submission of a request to the first LLM to select one API and to generate a first set of executable code based on the first set of instructions; receiving, by the at least one processor from the first LLM, a selection of the one API and the first set of executable code.” The modification would be obvious because one of ordinary skill in the art would be motivated to have a LLM select an API from a list of APIs to ensure the correct one is used for performing the task and to use in generating user instructions for performing API-based tasks (synthetic training data) that can be used in ensuring “the diversity of the training data to train an LLM in possessing or improving its capability in handling tasks that utilize external tools or APIs” (Liu, paragraphs [0003, 0077, & 0087]).

A claim chart follows with the corresponding limitations between Claim 1 of ‘688 and Claim 1 of the instant application in bold.

Co-Pending Application No. 18/675,688
Instant Application No. 18/612,079
1. A method for improving a quality of software code, the method being implemented by at least one processor, the method comprising: receiving, by the at least one processor, a first request for performing a first task; providing, by the at least one processor as an input to a first large language model (LLM), the first request; receiving, by the at least one processor from the first LLM, a first set of executable code that is intended to be usable for performing the first task; automatically executing the first set of executable code in an environment that includes at least one guardrail component that is configured to detect at least one type of error; detecting at least one error based on a result of the executing; determining at least one feedback item based on the at least one error; and prompting the first LLM to generate a second set of executable code based on the first request, the first set of executable code, and the at least one feedback item.

1. A method for evaluating code quality, the method being implemented by at least one processor, the method comprising: receiving, by the at least one processor, a first set of instructions for performing a first task and generating a first output; providing, by the at least one processor as an input to a first large language model (LLM), a list of available application programming interfaces (APIs) and the first set of instructions, together with a submission of a request to the first LLM to select one API and to generate a first set of executable code based on the first set of instructions; receiving, by the at least one processor from the first LLM, a selection of the one API and the first set of executable code; executing, by the at least one processor, the first set of executable code in order to perform the first task and generate the first output; and evaluating, by the at least one processor, a quality of the first set of executable code.


A claim chart follows with the corresponding limitations between Claims 9 and 17 of the instant application and Claims 10 and 19 of ‘688 in bold. A detailed analysis is not shown for the purpose of brevity.

Co-Pending Application No. 18/675,688
Instant Application No. 18/612,079
10. A computing apparatus for improving a quality of software code, the computing apparatus comprising: a processor; a memory; and a communication interface coupled to each of the processor, the memory, and the display, wherein the processor is configured to: receive, via the communication interface, a first request for performing a first task; provide, as an input to a first large language model (LLM), the first request; receive, from the first LLM via the communication interface, a first set of executable code that is intended to be usable for performing the first task; automatically execute the first set of executable code in an environment that includes at least one guardrail component that is configured to detect at least one type of error; detect at least one error based on a result of the executing; determine at least one feedback item based on the at least one error; and prompt the first LLM to generate a second set of executable code based on the first request, the first set of executable code, and the at least one feedback item.
9. A computing apparatus for evaluating code quality, the computing apparatus comprising: a processor; a memory; and a communication interface coupled to each of the processor and the memory, wherein the processor is configured to: receive, via the communication interface, a first set of instructions for performing a first task and generating a first output; provide, as an input to a first large language model (LLM), a list of available application programming interfaces (APIs) and the first set of instructions, together with a submission of a request to the first LLM to select one API and to generate a first set of executable code based on the first set of instructions; receive, from the first LLM, a selection of the one API and the first set of executable code; execute the first set of executable code in order to perform the first task and generate the first output; and evaluate a quality of the first set of executable code.
19. A non-transitory computer readable storage medium storing instructions for improving a quality of software code, the storage medium comprising a first set of executable code which, when executed by a processor, causes the processor to: receive a first request for performing a first task; provide, as an input to a first large language model (LLM), the first request; receive, from the first LLM, a second set of executable code that is intended to be usable for performing the first task; automatically execute the second set of executable code in an environment that includes at least one guardrail component that is configured to detect at least one type of error; detect at least one error based on a result of the execution; determine at least one feedback item based on the at least one error; and prompt the first LLM to generate a third set of executable code based on the first request, the second set of executable code, and the at least one feedback item.
17. A non-transitory computer readable storage medium storing instructions for evaluating code quality, the storage medium comprising a first set of executable code which, when executed by a processor, causes the processor to: receive a first set of instructions for performing a first task and generating a first output; provide, as an input to a first large language model (LLM), a list of available application programming interfaces (APIs) and the first set of instructions, together with a submission of a request to the first LLM to select one API and to generate a second set of executable code based on the first set of instructions; receive, from the first LLM, a selection of the one API and the second set of executable code; execute the second set of executable code in order to perform the first task and generate the first output; and evaluate a quality of the second set of executable code.


Thus, Claims 1, 9, and 17 of the instant application are obvious over Claims 1, 10, and 19 of ‘688.
This is a provisional nonstatutory double patenting rejection.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 2, 9, 10, 17, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over US 2024/0028312 (hereinafter “Gillman”) in view of US 2025/0258723 (hereinafter “Liu”).

As per Claim 1, Gillman discloses:
A method for evaluating code quality, the method being implemented by at least one processor (paragraph [0096], “The systems and methods described above may be implemented as a method, apparatus, or article of manufacture […] The techniques described above may be implemented in one or more computer programs executing on a programmable computer including a processor (emphasis added).”), the method comprising:
receiving, by the at least one processor, a first set of instructions for performing a first task and generating a first output (paragraph [0082], “[…] the method 600 includes receiving, by a machine learning engine [processor], a user-specified data set and a natural language description of a user-requested data transformation task [first set of instructions] for execution with a subset of the user-specified data set (602) (emphasis added).”; paragraph [0081], “The method 600 includes executing, by the machine learning engine [processor], the at least one candidate executable computer code to generate a transformation result [first output] (608) (emphasis added).”; paragraph [0075], “By way of example, the system 100 may include functionality for receiving a natural language description of a data manipulation for data in a spreadsheet […] generate executable code using the natural language description and automatically (e.g., without human intervention) execute the code to complete the described data manipulation (emphasis added).”; paragraph [0017], “The machine learning engine 103 may be provided as a hardware component.”);
providing, by the at least one processor as an input to a first large language model (LLM), […] the first set of instructions, together with a submission of a request to the first LLM to […] generate a first set of executable code based on the first set of instructions (paragraph [0081], “The method 600 includes directing, by the machine learning engine [processor], a large language model to generate at least one candidate executable computer code for performing the user-requested data transformation task (604) (emphasis added).”; paragraph [0078], “The method may include generating, by the large language model, executable computer code in a computer programming language specified in the natural language description of the user-requested data transformation task (emphasis added).”) [Examiner’s Remarks: Note that Gillman discloses the machine learning engine directing a large language model to generate executable computer code for performing the user-requested task. One of ordinary skill in the art would readily comprehend that in order for the large language model to generate the executable code that performs the user-requested task, the machine learning engine (processor) had to have provided the large language model with the user-requested task (first set of instructions) as an input together with a request to generate executable code based on the task (first set of instructions) while directing the large language model.];
receiving, by the at least one processor from the first LLM, […] the first set of executable code (paragraph [0080], “[…] the functionality of a large learning model trained to generate executable code for use in performing additional tasks on the output of the generated machine learning models and with the functionality of the machine learning engine [processor] to evaluate and validate the generated code and then execute the generated code (emphasis added).”) [Examiner’s Remarks: Note that Gillman discloses the large language model generating executable code and the machine learning engine validating/executing the generated code. One of ordinary skill in the art would readily comprehend that the machine learning engine (processor) must have received the generated executable code from the large language model in order to validate and execute the generated code.];
executing, by the at least one processor, the first set of executable code in order to perform the first task and generate the first output (paragraph [0081], “The method 600 includes directing, by the machine learning engine, a large language model to generate at least one candidate executable computer code for performing the user-requested data transformation task (604) […] The method 600 includes executing, by the machine learning engine [processor], the at least one candidate executable computer code to generate a transformation result (608) (emphasis added).”); and
evaluating, by the at least one processor, a quality of the first set of executable code (paragraph [0081], “The method 600 includes performing, by the machine learning engine [processor], at least one validation [quality] check on the at least one candidate executable computer code (606) (emphasis added).”).
Gillman discloses “providing, by the at least one processor as an input to a first large language model (LLM), […] the first set of instructions, together with a submission of a request to the first LLM to […] generate a first set of executable code based on the first set of instructions” and “receiving, by the at least one processor from the first LLM, […] the first set of executable code” but does not explicitly disclose:
providing, by the at least one processor as an input to a first large language model (LLM), a list of available application programming interfaces (APIs) and the first set of instructions, together with a submission of a request to the first LLM to select one API and to generate a first set of executable code based on the first set of instructions;
receiving, by the at least one processor from the first LLM, a selection of the one API and the first set of executable code.
However, Liu discloses:
providing, as an input to a first large language model (LLM), a list of available application programming interfaces (APIs), together with a submission of a request to the first LLM to select one API (paragraph [0006], “The first textual prompt for synthesizing user instructions can include, for instance, the list of APIs, the API documents for the list of APIs (or content of the API documents), and a request (in natural language) to generate a user instruction for performing a task using [selecting] a single tool/API from the list of APIs. The request, for instance, can be “generate a user instruction for performing a task using a single API from the provided APIs” (emphasis added).”; paragraph [0011], “During each of the one or more iterations, the first textual prompt for synthesizing user instructions can be processed as input, using the first trained LLM, to generate a corresponding synthetic natural language user instruction that describes a corresponding single-tool (e.g., single-API) task to be performed (emphasis added).”; paragraph [0087], “[…] processing a first textual prompt as input, using the first LLM, to generate a first synthetic natural language user instruction that describes to perform a first task. The first synthetic natural language user instruction may or may not identify a particular API which is the only API to be used to perform the first task (emphasis added).”);
receiving a selection of the one API (paragraph [0020], “Optionally, in some implementations, the m.sup.th synthetic natural language user instruction (A.sub.m) that describes the m.sup.th single-API task to be performed, can be processed iteratively using the second trained LLM, to generate the list of execution steps (emphasis added).”; paragraph [0068], “[…] in some cases, a synthetic natural language user instruction that describes a corresponding single-API task to be performed can explicitly identify the single API via which the single-API task is to be performed (emphasis added).”).
Gillman is within the same field of endeavor as the claimed invention regarding the utilization of LLM for code generation. Liu is also within the same field of endeavor as the claimed invention regarding the utilization of LLM for API selection.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teaching of Liu into the teaching of Gillman to include “providing, by the at least one processor as an input to a first large language model (LLM), a list of available application programming interfaces (APIs) and the first set of instructions, together with a submission of a request to the first LLM to select one API and to generate a first set of executable code based on the first set of instructions; receiving, by the at least one processor from the first LLM, a selection of the one API and the first set of executable code.” The modification would be obvious because one of ordinary skill in the art would be motivated to have a LLM select an API from a list of APIs to ensure the correct one is used for performing the task and to use in generating user instructions for performing API-based tasks (synthetic training data) that can be used in ensuring “the diversity of the training data to train an LLM in possessing or improving its capability in handling tasks that utilize external tools or APIs” (Liu, paragraphs [0003, 0077, & 0087]).

As per Claim 2, the rejection of Claim 1 is incorporated; and Gillman further discloses:
wherein the evaluating of the quality of the first set of executable code comprises evaluating at least one from among an accuracy of the first set of executable code, a robustness of the first set of executable code, and a consistency of the first set of executable code (paragraph [0084], “The method 600 includes performing, by the machine learning engine, at least one validation [quality] check on the at least one candidate executable computer code (606) […] (the validation step is important because not all generated programs will be correct or viable; many will not run) [evaluating an accuracy of the first set of executable code] […] (emphasis added).”).

As per Claim 9, Gillman discloses:
A computing apparatus (paragraph [0096], “The systems and methods described above may be implemented as a method, apparatus, or article of manufacture […] (emphasis added).”) for evaluating code quality, the computing apparatus comprising:
a processor (paragraph [0096], “The techniques described above may be implemented in one or more computer programs executing on a programmable computer including a processor […].”);
a memory (paragraph [0108], “In the embodiment shown in FIG. 4B, the processor 421 communicates with main memory 422 via a system bus 450.”); and
a communication interface coupled to each of the processor and the memory (paragraph [0108], “In the embodiment shown in FIG. 4B, the processor 421 communicates with main memory 422 via a system bus 450.”), wherein the processor is configured to: […].
Claim 9 is an apparatus claim corresponding to method Claim 1 and the remainder of Claim 9 is rejected for the same reasons as given in the rejection of Claim 1.

Claim 10 is an apparatus claim corresponding to method Claim 2 and is rejected for the same reasons as given in the rejection of that claim.

As per Claim 17, Gillman discloses:
A non-transitory computer readable storage medium storing instructions (paragraph [0098], “Storage devices suitable for tangibly embodying computer program instructions include, for example, all forms of computer-readable devices, firmware, programmable logic, hardware (e.g., integrated circuit chip; electronic devices; a computer-readable non-volatile storage unit; non-volatile memory, such as semiconductor memory devices, including EPROM, EEPROM, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROMs).”) for evaluating code quality, the storage medium comprising a first set of executable code which, when executed by a processor (paragraph [0098], “Method steps may be performed by a computer processor executing a program tangibly embodied on a computer-readable medium to perform functions of the methods and systems described herein by operating on input and generating output.”), causes the processor to: […].
Claim 17 is a non-transitory computer readable storage medium claim corresponding to method Claim 1 and the remainder of Claim 17 is rejected for the same reasons as given in the rejection of Claim 1.

Claim 18 is a non-transitory computer readable storage medium claim corresponding to method Claim 2 and is rejected for the same reasons as given in the rejection of that claim.

Claims 3 and 11 are rejected under 35 U.S.C. 103 as being unpatentable over Gillman in view of Liu as applied to Claims 2 and 10 above, and further in view of US 2010/0229151 (hereinafter “Yuan”) and “De-Hallucinator: Iterative Grounding for LLM-Based Code Completion” (hereinafter “Eghbali”).

As per Claim 3, the rejection of Claim 2 is incorporated; and Gillman discloses “the evaluating of the accuracy of the first set of executable code (paragraph [0084], “The method 600 includes performing, by the machine learning engine, at least one validation check on the at least one candidate executable computer code (606) […] (the validation step is important because not all generated programs will be correct or viable; many will not run) [evaluating an accuracy of the first set of executable code] (emphasis added).”)” and “the first set of executable code (paragraph [0081], “The method 600 includes directing, by the machine learning engine, a large language model to generate at least one candidate executable computer code for performing the user-requested data transformation task (604) (emphasis added).),” but the combination of Gillman and Liu does not explicitly disclose:
checking whether the first set of executable code runs;
checking whether the first set of executable code calls a correct API with correct parameters; and
checking whether the first output matches with an expected output.
However, Yuan discloses:
checking whether the first set of executable code runs (paragraph [0031], “At step 110, a quality control check can be made to the implantation code generated at step 108 in order to verify that such code will run properly when executed […] A local controls engineer or other personnel or device can manually or automatically compare the implementation code to a standard, can run an off-line test, or perform whatever other steps are needed to properly verify the accuracy of the code generated at step 110 (emphasis added).”).
Yuan is within the same field of endeavor as the claimed invention regarding the evaluation of code accuracy.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teaching of Yuan into the combined teachings of Gillman and Liu to include “checking whether the first set of executable code runs.” The modification would be obvious because one of ordinary skill in the art would be motivated to verify code accuracy and check whether generated code runs in order to ensure the code runs and functions properly when executed (Yuan, paragraph [0031]).
However, Eghbali discloses:
checking whether the first set of executable code calls a correct API with correct parameters (Section 5.3 RQ2: Correct Retrieval of API References page 15, Figure 8; Section 5.1 Experimental Setup page 12, “Exact API match. Since the goal of De-Hallucinator is to predict better API usages, we measure how many of all desired API usages are predicted exactly as in the ground truth. To identify the API usages in the lines of code to complete, we extract function calls, including the access path to the function, and the parameters. For example, given a line of code docs = ds.find_by_keyword(keyword) the corresponding API usage is ds.find_by_keyword(keyword). The exact API match then is the percentage of exact matches between the prediction and the ground truth API usages (emphasis added).”); and
checking whether the first output matches with an expected output (Section 5.1 Experimental Setup page 12, “Exact API match. Since the goal of De-Hallucinator is to predict better API usages, we measure how many of all desired API usages are predicted exactly as in the ground truth [expected output]. To identify the API usages in the lines of code to complete, we extract function calls, including the access path to the function, and the parameters. For example, given a line of code docs = ds.find_by_keyword(keyword) the corresponding API usage is ds.find_by_keyword(keyword). The exact API match then is the percentage of exact matches between the prediction [first output] and the ground truth API usages [expected output] (emphasis added).”).
Eghbali is within the same field of endeavor as the claimed invention regarding checking the correctness of LLM generated API calls.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teaching of Eghbali into the combined teachings of Gillman, Liu, and Yuan to include “checking whether the first set of executable code calls a correct API with correct parameters; and checking whether the first output matches with an expected output.” The modification would be obvious because one of ordinary skill in the art would be motivated to use a De-Hallucinator that improves input given to a LLM and check if the generated code makes correct API calls and matches expected output in order to “improve the quality of code completions over the state-of-the-art baseline” (Eghbali, Section 7 Related Work page 19 & Section 8 Conclusion page 20).

Claim 11 is an apparatus claim corresponding to method Claim 3 and is rejected for the same reasons as given in the rejection of that claim.

Claims 4, 6, 7, 12, 14, and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Gillman in view of Liu as applied to Claims 2 and 10 above, and further in view of “Can Large Language Models Identify And Reason About Security Vulnerabilities? Not Yet” (hereinafter “Ullah”).

As per Claim 4, the rejection of Claim 2 is incorporated [Examiner’s Remarks: Since Claim 2 stated only evaluating at least one from among an accuracy, robustness, and consistency of the first set of executable code and the Examiner chose evaluating an accuracy of the first set of executable code (the rejection of Claim 3), the rejection of Claim 4 is optional. However, in order to promote compact prosecution, the Examiner is including this rejection.]; and Gillman discloses “the first set of executable code (paragraph [0081], “The method 600 includes directing, by the machine learning engine, a large language model to generate at least one candidate executable computer code for performing the user-requested data transformation task (604) (emphasis added).)” and “the first set of instructions (paragraph [0082], “[…] the method 600 includes receiving, by a machine learning engine [processor], a user-specified data set and a natural language description of a user-requested data transformation task [first set of instructions] for execution with a subset of the user-specified data set (602) (emphasis added).”),” but does not explicitly disclose:
wherein the evaluating of the robustness of the first set of executable code comprises:
determining a difficulty level of the first set of instructions; and
assessing the selection of the one API and an ability to execute the first set of instructions based on the determined difficulty level.
However, Liu discloses:
the selection of the one API (paragraph [0006], “The first textual prompt for synthesizing user instructions can include, for instance, the list of APIs, the API documents for the list of APIs (or content of the API documents), and a request […] The request, for instance, can be “generate a user instruction for performing a task using [selecting] a single API from the provided APIs” (emphasis added).”; paragraph [0087], “[…] processing a first textual prompt as input, using the first LLM, to generate a first synthetic natural language user instruction that describes to perform a first task. The first synthetic natural language user instruction may or may not identify a particular API which is the only API to be used to perform the first task (emphasis added).”; paragraph [0020], “Optionally, in some implementations, the m.sup.th synthetic natural language user instruction (A.sub.m) that describes the m.sup.th single-API task to be performed, can be processed iteratively using the second trained LLM, to generate the list of execution steps (emphasis added).”).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teaching of Liu into the teaching of Gillman to include “the selection of the one API.” The modification would be obvious because one of ordinary skill in the art would be motivated to have a LLM select one API from a list of APIs to ensure the correct one is used and processed for performing the task and to use in training data “to train an LLM in possessing or improving its capability in handling tasks that utilize external tools or APIs” (Liu, paragraphs [0003, 0077, & 0087]).
However, Ullah discloses:
wherein the evaluating of the robustness of the [LLM] comprises:
determining a difficulty level of the first set of instructions (Section 4.6. Code Difficulty Levels page 11, “In this section, we investigate the capabilities of LLMs to handle different complexities of code. Similar to the previous sections, we find the best performing prompts for each difficulty level using Scorediff, with equal weight to all factors, from four prompting categories (emphasis added).”; Section 3.4. Datasets page 4, “Moreover, we design our code scenarios with three difficulty levels, (1) easy, (2) medium, (3) hard […] The difficulty levels assess how LLMs interact with code of increasing complexity (emphasis added).”); and
assessing an ability to execute the first set of instructions based on the determined difficulty level (Section 4.6. Code Difficulty Levels page 11, “In this section, we investigate the capabilities of LLMs to handle different complexities of code. Similar to the previous sections, we find the best performing prompts for each difficulty level using Scorediff, with equal weight to all factors, from four prompting categories (emphasis added).”; Section 1. Introduction page 1, “Our framework tests the capabilities of a given LLM as a security assistant across eight distinct dimensions: […] (6) assessment of various code difficulty levels, (7) robustness to code augmentations […] (emphasis added).”; Section 3.4. Datasets page 4, “We design 228 code scenarios (48 hand-crafted, 30 real world, and 150 with code augmentations) to test various aspects of the capabilities of LLMs to detect software vulnerabilities in code. We use these scenarios to craft prompts by including code, examples, definitions, and step-by-step reasoning as shown in Table 3.”) [Examiner’s Remarks: Note that Ullah discloses testing the capabilities of a LLM to detect software vulnerabilities in code and handle different complexities of code, finding the best performing prompt for each difficulty level, and prompts (instructions) that include code. One of ordinary skill in the art would readily comprehend that testing the capabilities of LLMs to detect software vulnerabilities in code and handle different complexities of code is assessing its ability to execute the instructions in the prompt (finding the software vulnerability) based on the determined difficulty level.].
Ullah is within the same field of endeavor as the claimed invention regarding LLMs and assigning difficulty levels to instructions.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teaching of Ullah into the combined teachings of Gillman and Liu to include “wherein the evaluating of the robustness of the first set of executable code comprises: determining a difficulty level of the first set of instructions; and assessing the selection of the one API and an ability to execute the first set of instructions based on the determined difficulty level.” The modification would be obvious because one of ordinary skill in the art would be motivated to determine a difficulty level of instructions and assess LLMs ability to execute it using an automated framework, such as in the case of software vulnerability detection, in order to effectively identify the best performing LLMs based on the difficulty level and utilizing the framework as a “useful tool […] to evaluate the progress of future LLM versions in vulnerability detection” (Ullah, Section 4.6. Code Difficulty Levels: Observations page 11 & Section 6. Conclusion page 14).

As per Claim 6, the rejection of Claim 2 is incorporated [Examiner’s Remarks: Since Claim 2 stated only evaluating at least one from among an accuracy, robustness, and consistency of the first set of executable code and the Examiner chose evaluating an accuracy of the first set of executable code (the rejection of Claim 3), the rejection of Claim 6 is optional. However, in order to promote compact prosecution, the Examiner is including this rejection.]; and Gillman discloses “the first set of executable code (paragraph [0081], “The method 600 includes directing, by the machine learning engine, a large language model to generate at least one candidate executable computer code for performing the user-requested data transformation task (604) (emphasis added).)” and “executing of the first set of executable code (paragraph [0081], “The method 600 includes executing, by the machine learning engine, the at least one candidate executable computer code to generate a transformation result (608) (emphasis added).”),” but the combination of Gillman and Liu does not explicitly disclose:
wherein the evaluating of the consistency of the first set of executable code comprises:
testing results of the executing of the first set of executable code across multiple runs; and
determining whether the results provide different answers for a same input.
However, Ullah discloses:
wherein the evaluating of the consistency of the [LLM] comprises:
testing results of the [LLM] across multiple runs (Section 4.1. Evaluation for Deterministic Responses page 6, “To perform a rigorous comparison between LLMs and assess their capabilities, it is of critical importance that their responses are consistent, meaning that running the same test multiple times under identical parameters should provide the same final verdict (emphasis added).”); and
determining whether the results provide different answers for a same input (Section 4.1. Evaluation for Deterministic Responses page 6, “To perform a rigorous comparison between LLMs and assess their capabilities, it is of critical importance that their responses are consistent, meaning that running the same test multiple times under identical parameters should provide the same final verdict (emphasis added).”; Section 4.1. Evaluation for Deterministic Responses page 7, “We run each experiment ten times, and record how many times the model provides the same answer (emphasis added).”).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teaching of Ullah into the combined teachings of Gillman and Liu to include “wherein the evaluating of the consistency of the first set of executable code comprises: testing results of the executing of the first set of executable code across multiple runs; and determining whether the results provide different answers for a same input.” The modification would be obvious because one of ordinary skill in the art would be motivated to test results of an LLM across multiple runs and see if it provides different answers for the same input in order to improve reliability and consistency of LLM responses by determining which “parameters deliver the most consistent results” (Ullah, Section 4.1. Evaluation for Deterministic Responses pages 6).

As per Claim 7, the rejection of Claim 6 is incorporated [Examiner’s Remarks: Since Claim 2 stated only evaluating at least one from among an accuracy, robustness, and consistency of the first set of executable code and the Examiner chose evaluating an accuracy of the first set of executable code (the rejection of Claim 3), the rejection of Claim 7 is optional since it depends on Claim 6. However, in order to promote compact prosecution, the Examiner is including this rejection.]; and the combination of Gillman and Liu does not explicitly disclose:
wherein the testing of the results is performed for at least three runs and for at most ten runs.
However, Ullah discloses:
wherein the testing of the results is performed for at least three runs and for at most ten runs (Section 4.1. Evaluation for Deterministic Responses page 6, “To perform a rigorous comparison between LLMs and assess their capabilities, it is of critical importance that their responses are consistent, meaning that running the same test multiple times under identical parameters should provide the same final verdict (emphasis added).”; Section 4.1. Evaluation for Deterministic Responses page 7, “We run each experiment ten times, and record how many times the model provides the same answer (emphasis added).”).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teaching of Ullah into the combined teachings of Gillman and Liu to include “wherein the testing of the results is performed for at least three runs and for at most ten runs.” The modification would be obvious because one of ordinary skill in the art would be motivated to test results of an LLM across multiple runs and see if it provides different answers for the same input in order to improve reliability and consistency of LLM responses by determining which “parameters deliver the most consistent results” (Ullah, Section 4.1. Evaluation for Deterministic Responses pages 6).

Claim 12 is an apparatus claim corresponding to method Claim 4 and is rejected for the same reasons as given in the rejection of that claim.

Claim 14 is an apparatus claim corresponding to method Claim 6 and is rejected for the same reasons as given in the rejection of that claim.

Claim 15 is an apparatus claim corresponding to method Claim 7 and is rejected for the same reasons as given in the rejection of that claim.

Claims 5 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Gillman in view of Liu and Ullah as applied to Claims 4 and 12 above, and further in view of US 2024/0104308 (hereinafter “Francis”).

As per Claim 5, the rejection of Claim 4 is incorporated [Examiner’s Remarks: Since Claim 2 stated only evaluating at least one from among an accuracy, robustness, and consistency of the first set of executable code and the Examiner chose evaluating an accuracy of the first set of executable code (the rejection of Claim 3), the rejection of Claim 5 is optional since it depends on Claim 4. However, in order to promote compact prosecution, the Examiner is including this rejection.]; and the combination of Gillman, Liu and Ullah does not explicitly disclose:
determining a degree of implicitness of information included in the first set of instructions with respect to the first task.
However, Francis discloses:
determining a degree of implicitness of information included in the first set of instructions with respect to the first task (paragraph [0028], “In some embodiments, the systems and methods described herein may be configured to, when given a high-level instruction or question, use the agent to the leverage commonsense and spatial knowledge to decompose the implicit navigation and/or manipulation task into tractable subtasks. For example, if the agent is initialized in the living room of a home then asked the question, “What color is the car?”, the agent may generate a plan for, first, searching for the car in the garage or outside the house on the driveway (emphasis added).”) [Examiner’s Remarks: Note that Francis discloses using an agent to leverage commonsense and spatial knowledge and decompose the implicit task into tractable subtasks when given a high-level instruction. One of ordinary skill in the art would readily comprehend that decomposing the implicit task within a high-level instruction into subtasks includes determining a degree of implicitness of the information within the instruction with respect to the task in order to leverage commonsense and spatial knowledge to decompose the task.].
Francis is within the same field of endeavor as the claimed invention regarding the determination of implicitness within task instructions.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teaching of Francis into the combined teachings of Gillman, Liu, and Ullah to include “determining a degree of implicitness of information included in the first set of instructions with respect to the first task.” The modification would be obvious because one of ordinary skill in the art would be motivated to determine a degree of implicitness within a high-level instruction and use an agent to decompose the implicit task into tractable subtasks by leveraging commonsense and spatial knowledge (domain knowledge) because agents that are “encouraged to learn reasoning strategies, on top of this domain knowledge, perform better than those that simply perform statistical pattern-matching” (Francis, paragraph [0021 & 0028]).

Claim 13 is an apparatus claim corresponding to method Claim 5 and is rejected for the same reasons as given in the rejection of that claim.

Claims 8 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Gillman in view of Liu as applied to Claims 2 and 10 above, and further in view of Eghbali.

As per Claim 8, the rejection of Claim 2 is incorporated; and Gillman discloses “wherein the evaluating of the quality of the first set of executable code is performed by using [statistical analysis] (paragraph [0084], “The method 600 includes performing, by the machine learning engine, at least one validation [quality] check on the at least one candidate executable computer code (606) […] (the validation step is important because not all generated programs will be correct or viable; many will not run); performing security and validation checks such as: evaluating the code using static analysis […] (emphasis added).”),” but the combination of Gillman and Liu does not explicitly disclose:
wherein the evaluating of the quality of the first set of executable code is performed by using an evaluation dataset that is API-based.
However, Eghbali discloses:
an evaluation dataset that is API-based (Section 5.1 Experimental Setup pages 11-12, “We construct a dataset of API-related code completion tasks by removing API usages from the benchmark projects and by considering the removed code as the ground truth to be predicted by a model […] Overall, the evaluation dataset consists of 11 projects × 10 × 4 models = 440 code completion tasks (emphasis added).”).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teaching of Eghbali into the combined teachings of Gillman and Liu to include “wherein the evaluating of the quality of the first set of executable code is performed by using an evaluation dataset that is API-based.” The modification would be obvious because one of ordinary skill in the art would be motivated to utilize an evaluation dataset that is API-based in order to effectively evaluate API-based code completions of models based on the tasks within the dataset (Eghbali, Section 5.1 Experimental Setup pages 11-12). Moreover, one of ordinary skill in the art would be motivated to utilize both an API-based evaluation dataset of code completion tasks and a De-Hallucinator that improves input given to a LLM to check if model generated code makes correct API calls in order to “improve the quality of code completions over the state-of-the-art baseline” (Eghbali, Section 7 Related Work page 19 & Section 8 Conclusion page 20).

Claim 16 is an apparatus claim corresponding to method Claim 8 and is rejected for the same reasons as given in the rejection of that claim.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
US 2025/0078822 (hereinafter “Perkins”) discloses a language model being prompted to select API calls and being provided with a request that includes a list of API calls and instructions to indicate an API call to be performed.
US 2025/0181332 (hereinafter “Madhur”) discloses a validation algorithm that checks if a code block only makes white-listed API calls and testing code to see if it matches expected values.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to FEVEN H HURUY whose telephone number is (571) 272-3826. The examiner can normally be reached Mon-Fri. 7:30am-3:45pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Wei Mui can be reached at (571) 272-3708. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/F.H.H./Examiner, Art Unit 2191                                                                                                                                                                                                        /WEI Y MUI/Supervisory Patent Examiner, Art Unit 2191
Read full office action
METHOD AND SYSTEM FOR EVALUATION OF CODE GENERATION BY LARGE LANGUAGE MODEL

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

METHOD AND SYSTEM FOR EVALUATION OF CODE GENERATION BY LARGE LANGUAGE MODEL

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email