Office Action Analysis: 18303525 — SYSTEMS AND METHODS FOR GENERATING MODEL ARCHITECTURES FOR TASK-SPECIFIC MODELS IN ACCELERATED TRANSFER LEARNING

Office Action

§101 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The six information disclosure statement (IDS) documents submitted on April 21, 2023; August 17, 2023; August 14, 2024; April 29, 2025; July 17, 2025; and November 25, 2025 are in compliance with the provisions of 37 CFR 1.97 and has been considered by the examiner.

Claim Objections
In claim 17, line 2,“(ii) an parallel layers selector search space,” should read ‘a parallel layers…’ 

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea (mental process) without significantly more.
Claim 1:
	Regarding claim 1, in step 1 of the 101-analysis set forth in MPEP 2106, the claim recites
“a system for generating one or more task-specific machine learning models for use in conjunction with one or more accelerated machine learning models configured for execution on one or more hardware accelerators, the system comprising: one or more processors…”, and a system or machine is one of the four statutory categories of invention.
	In step 2A prong 1 of the 101-analysis set forth in the MPEP 2106, the examiner has determined that the following limitations recite a process that, under the broadest reasonable interpretation, covers a mental process but for recitation of generic computer components:
“identify a selected search space, the selected search space being selected from a plurality of pre-defined search spaces,” (this is a mental process, a person can mentally evaluate to identify and select a search space, see MPEP 2106.04(a)(2)(III)),
“determine a set of candidate model architectures from the selected search space utilizing model architecture search,” (mental process, a person can mentally evaluate and determine a set of model architectures from the selected search space, see MPEP 2106.04(a)(2)(III)),
“select one or more task-specific machine learning models from the trained set of task-specific machine learning models based upon an evaluation of performance of each trained task-specific machine learning model of the trained set of task-specific machine learning models,” (mental process, a person can mentally evaluate and select one or more task-specific machine learning models from the trained set of models, see MPEP 2106.04(a)(2)(III)),
If claim limitations, under their broadest reasonable interpretation, covers performance of the limitations as a mental process but for the recitation of generic computer components, then it falls within the mental process grouping of abstract ideas. Accordingly, the claim “recites” an abstract idea.
In step 2A prong 2 of the 101-analysis set forth in MPEP 2106, the examiner has determined that the following additional elements do not integrate this judicial exception into a practical application:
“A system for generating one or more task-specific machine learning models for use in conjunction with one or more accelerated machine learning models configured for execution on one or more hardware accelerators, the system comprising: one or more processors,” (Using processors is considered mere instructions to apply an exception using generic computer – see MPEP 2106.05(f)),
“one or more hardware storage devices that store instructions that are executable by the one or more processors to configure the system..” (Using hardware storage devices is considered mere instructions to apply an exception using generic computer – see MPEP 2106.05(f)),
“train a set of task-specific machine learning models adapted for performance of one or more particular machine learning tasks, wherein each task-specific machine learning model of the set of task-specific machine learning models comprises a model architecture from the set of candidate model architectures determined from the selected search space utilizing model architecture search, and wherein each task-specific machine learning model is trained using a set of training data comprising (i) input data comprising at least a set of embeddings generated by one or more accelerated machine learning models in response to input, and (ii) task-specific ground truth output comprising one or more ground truth labels associated with the one or more particular machine learning tasks,”(Training a set of models is considered mere instructions to apply an exception using generic computer – see MPEP 2106.05(f))
Since the claim as a whole, looking at the additional elements individually and in combination, does not contain any other additional elements that are indicative of integration into a practical application, the claim is “directed” to an abstract idea.
In step 2B of the 101-analysis set forth in the 2019 PEG, the examiner has determined that the claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. 
As discussed above, additional elements iv, v, and vi recite mere instructions to apply the judicial exception using generic computer components, which are not indicative of significantly more. 
 	Considering the additional elements individually and in combination, and the claim as a whole, the additional elements do not provide significantly more than the abstract idea. Therefore, the claim is not patent eligible.

Claim 2:
Regarding claim 2, it is dependent upon claim 1, and thereby incorporates the limitations of, and corresponding analysis applied to claim 1. Further, claim 2 recites the following additional elements:
“The system of claim 1, wherein the instructions are executable by the one or more processors to further configure the system to: receive a set of input embeddings generated by the one or more accelerated machine learning models,” (In step 2A, prong 2, this recites mere data gathering, which is considered insignificant extra-solution activity – see MPEP 2106.05(g),). In step 2B, this insignificant extra-solution activity is well understood routine and conventional activity which includes receiving or transmitting data over a network from court case Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362 (utilizing an intermediary computer to forward information); TLI Communications LLC v. AV Auto. LLC, 823 F.3d 607, 610, 118 USPQ2d 1744, 1745 (Fed. Cir. 2016) – see MPEP 2106.05(d) (II)(i),
“generate task-specific output by utilizing the set of input embeddings as input to the one or more task-specific machine learning models,” (In step 2A, prong 2, this is considered mere instructions to apply an exception using generic computer with the generate task-specific output operation performed by any generic computer, see MPEP 2106.05(f)). (In step 2B, this is also considered mere instructions to apply an exception using generic computer – see MPEP 2106.05(f)).
Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.

Claim 3:
Regarding claim 3, it is dependent upon claim 1, and thereby incorporates the limitations of, and corresponding analysis applied to claim 1. Further, claim 3 recites the following additional element:
“The system of claim 1, wherein the one or more accelerated machine learning models are configured to be executed on one or more hardware accelerators,” (In step 2A, prong 2, this is considered mere instructions to apply an exception using generic computer with the generate task-specific output operation performed by any generic computer, see MPEP 2106.05(f)). (In step 2B, this is also considered mere instructions to apply an exception using generic computer – see MPEP 2106.05(f)).
Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.

Claim 4:
Regarding claim 4, it is dependent upon claim 3, and thereby incorporates the limitations of, and corresponding analysis applied to claim 3. Further, claim 4 recites the following additional element:
“The system of claim 3, wherein the one or more hardware accelerators comprise one or more field-programmable gate arrays (FPGAs), graphics processing units (GPUs), tensor processing units (TPUs), or application-specific integrated circuits (ASICs),” (In step 2A, prong 2, this is considered mere instructions to apply an exception using generic computer with the generate task-specific output operation performed by any generic computer, see MPEP 2106.05(f)). (In step 2B, this is also considered mere instructions to apply an exception using generic computer – see MPEP 2106.05(f)).
Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.

Claim 5:
Regarding claim 5, it is dependent upon claim 1, and thereby incorporates the limitations of, and corresponding analysis applied to claim 1. Further, claim 5 recites the following additional element:
“The system of claim 1, wherein the plurality of pre-defined search spaces comprises at least (i) a parallel layers search space and (ii) a parallel layers selector search space,” (In step 2A, prong 2, this is considered mere instructions to apply an exception using generic computer with the generating task-specific output operation performed by any generic computer, see MPEP 2106.05(f)). (In step 2B, this is also considered mere instructions to apply an exception using generic computer – see MPEP 2106.05(f)).
Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.

Claim 6:
Regarding claim 6, it is dependent upon claim 1, and thereby incorporates the limitations of, and corresponding analysis applied to claim 1. Further, claim 6 recites the following abstract idea:
“The system of claim 1, wherein the selected search space is selected based upon one or more computational constraints,” (this is considered a mental process, since a person can mentally evaluate and select a search space from evaluating computational constraints, see MPEP 2106.04(a)(2)(III)),
If claim limitations, under their broadest reasonable interpretation, covers performance of the limitations as a mathematical concept but for the recitation of generic computer components, then it falls within the mathematical concept grouping of abstract ideas. Accordingly, the claim “recites” an abstract idea.
Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.

Claim 7:
Regarding claim 7, it is dependent upon claim 5, and thereby incorporates the limitations of, and corresponding analysis applied to claim 5. Further, claim 7 recites the following additional element:
“The system of claim 5, wherein, when the selected search space comprises the parallel layers selector search space, the input data further comprises intermediate output generated by the one or more accelerated machine learning models,” (In step 2A, prong 2, this is considered mere instructions to apply an exception using generic computer with the generating intermediate output by one or more accelerated machine learning models operation performed by any generic computer, see MPEP 2106.05(f)). (In step 2B, this is also considered mere instructions to apply an exception using generic computer – see MPEP 2106.05(f)).
Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.

Claim 8:
Regarding claim 8, it is dependent upon claim 1, and thereby incorporates the limitations of, and corresponding analysis applied to claim 1. Further, claim 8 recites the following additional element:
“The system of claim 1, wherein determining the set of candidate model architectures comprises utilizing a neural architecture search framework,” (In step 2A, prong 2, this is considered mere instructions to apply an exception using generic computer with utilizing a neural architecture search framework operation performed by any generic computer, see MPEP 2106.05(f)). (In step 2B, this is also considered mere instructions to apply an exception using generic computer – see MPEP 2106.05(f)).
Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.


Claim 9:
Regarding claim 9, it is dependent upon claim 8, and thereby incorporates the limitations of, and corresponding analysis applied to claim 8. 
Further, claim 9 recites the following abstract ideas:
“The system of claim 8, wherein determining the set of candidate model architectures comprises: generating a set of initial candidate model architectures by sampling from the selected search space,” (this is considered a mental process, a person can mentally evaluate a search space, and generate a set of models by sampling from the search space, see MPEP 2106.04(a)(2)(III)), 
“evaluating whether each of the initial candidate model architectures of the set of initial candidate model architectures satisfies one or more performance metrics,” (this is considered a mental process, since a person can mentally evaluate and judge to see if a set of models satisfies performance metrics, see MPEP 2106.04(a)(2)(III)),
“defining the set of candidate model architectures as the initial candidate model architectures of the set of initial candidate model architectures that satisfy the one or more performance metrics,” (this is considered a mental process, since a person can mentally evaluate, judge, then define a set of models to see if they satisfy performance metrics, see MPEP 2106.04(a)(2)(III)),
If claim limitations, under their broadest reasonable interpretation, covers performance of the limitations as a mathematical concept but for the recitation of generic computer components, then it falls within the mathematical concept grouping of abstract ideas. Accordingly, the claim “recites” an abstract idea.
Further, claim 9 recites the following additional elements:
“training initial candidate model architectures of the set of initial candidate model architectures using a set of NAS training data,” (In step 2A, prong 2, training models is considered mere instructions to implement an abstract idea using generic computer – see MPEP 2106.05(f)). (In step 2B, this is also considered mere instructions to implement an abstract idea using generic computer – see MPEP 2106.05(f)),
Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.

Claim 10:
Regarding claim 10, it is dependent upon claim 9, and thereby incorporates the limitations of, and corresponding analysis applied to claim 9. Further, claim 10 recites the following additional element:
“The system of claim 9, wherein the set of NAS training data also comprises (i) input data generated by the one or more accelerated machine learning models and (ii) task-specific ground truth output,” (In step 2A, prong 2, this is considered mere instructions to apply an exception using generic computer with using a model to generate input data and task-specific ground truth output operation performed by any generic computer, see MPEP 2106.05(f)). (In step 2B, this is also considered mere instructions to apply an exception using generic computer – see MPEP 2106.05(f)).
Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.

Claim 11:
Regarding claim 11, it is dependent upon claim 10, and thereby incorporates the limitations of, and corresponding analysis applied to claim 10. Further, claim 11 recites the following additional element: 
“The system of claim 10, wherein the input data of the set of NAS training data comprises intermediate output generated by the one or more accelerated machine learning models when the selected search space comprises a parallel layers selector search space,” (In step 2A, prong 2, this is considered mere instructions to apply an exception using generic computer with using an accelerated machine learning model to generate output when search space has parallel layers operation performed by any generic computer, see MPEP 2106.05(f)). (In step 2B, this is also considered mere instructions to apply an exception using generic computer – see MPEP 2106.05(f)),
Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.

Claim 12:
Regarding claim 12, it is dependent upon claim 9, and thereby incorporates the limitations of, and corresponding analysis applied to claim 9. Further, claim 12 recites the following abstract idea:
“The system of claim 9, wherein determining the set of candidate model architectures comprises generating a set of weights for each candidate model architecture of the set of candidate model architectures,” (this is considered a mental process, since a person can mentally evaluate and generate a set of weights (seen as numeric quantities) for each candidate model, see MPEP 2106.04(a)(2)(III)), 
If claim limitations, under their broadest reasonable interpretation, covers performance of the limitations as a mathematical concept but for the recitation of generic computer components, then it falls within the mathematical concept grouping of abstract ideas. Accordingly, the claim “recites” an abstract idea.
Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.

Claim 13:
Regarding claim 13, it is dependent upon claim 12, and thereby incorporates the limitations of, and corresponding analysis applied to claim 12. Further, claim 13 recites the following additional element:
“The system of claim 12, wherein training the set of task-specific machine learning models based upon the set of candidate model architectures, comprises refraining from using the set of weights for each candidate model architecture of the set of candidate model architectures” (In step 2A, prong 2, this is considered mere instructions to apply an exception using generic computer with training models based upon candidate model architectures and refraining from using the set of weights operation performed by any generic computer, see MPEP 2106.05(f)). (In step 2B, this is also considered mere instructions to apply an exception using generic computer – see MPEP 2106.05(f)),
Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.

Claim 14:
Regarding claim 14, it is dependent upon claim 1, and thereby incorporates the limitations of, and corresponding analysis applied to claim 1. Further, claim 14 recites the following additional element:
“The system of claim 1, wherein the evaluation of performance of each task-specific machine learning model of the set of task-specific machine learning models utilizes a set of validation data, wherein the set of validation data also comprises (i) input data generated by the one or more accelerated machine learning models and (ii) task-specific ground truth output,” (In step 2A, prong 2, this is considered mere instructions to apply an exception using generic computer with utilizing validation data that has input and output data to evaluate model performance operation performed by any generic computer, see MPEP 2106.05(f)). (In step 2B, this is also considered mere instructions to apply an exception using generic computer – see MPEP 2106.05(f)),
Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.

Claim 15:Regarding claim 15, in step 1 of the 101-analysis set forth in MPEP 2106, the claim recites
“a system for generating a set of model architectures for a task-specific machine learning model for use in conjunction with an accelerated machine learning model, the system comprising: one or more processors;…” and a system or machine is one of the four statutory categories of invention.
In step 2A prong 1 of the 101-analysis set forth in the MPEP 2106, the examiner has determined that the following limitations recite a process that, under the broadest reasonable interpretation, covers a mental process but for recitation of generic computer components:
“identify a selected search space, the selected search space being selected from a plurality of pre-defined search spaces,” (this is considered a mental process, a person can mentally evaluate and identify a search space, see MPEP 2106.04(a)(2)(III)),
“determine a set of candidate model architectures from the selected search space utilizing model architecture search, wherein determining the set of candidate model architectures,” (mental process, a person can mentally evaluate and determine a set of models from a selected search space, see MPEP 2106.04(a)(2)(III)),
“evaluating whether each of the initial candidate model architectures of the set of initial candidate model architectures satisfies one or more performance metrics,” (this is considered a mental process, since a person can mentally evaluate and judge to see if a set of models satisfies performance metrics, see MPEP 2106.04(a)(2)(III)),
“defining the set of candidate model architectures as the initial candidate model architectures of the set of initial candidate model architectures that satisfy the one or more performance metrics…” (this is considered a mental process, since a person can mentally evaluate, judge, then define a set of models to see if they satisfy one or more performance metrics, see MPEP 2106.04(a)(2)(III)),
If claim limitations, under their broadest reasonable interpretation, covers performance of the limitations as a mental process but for the recitation of generic computer components, then it falls within the mental process grouping of abstract ideas. Accordingly, the claim “recites” an abstract idea.
In step 2A prong 2 of the 101-analysis set forth in MPEP 2106, the examiner has determined that the following additional elements do not integrate this judicial exception into a practical application:
“A system for generating a set of model architectures for a task-specific machine learning model for use in conjunction with an accelerated machine learning model, the system comprising: one or more processors,” (Mere instructions to apply an exception using generic computer – see MPEP 2106.05(f))
“one or more hardware storage devices that store instructions that are executable by the one or more processors to configure the system,” (Mere instructions to apply an exception using generic computer – see MPEP 2106.05(f))
“…comprises: generating a set of initial candidate model architectures by sampling from the selected search space,” (Generating model architectures is similar to outputting models and is considered mere instructions to apply an exception using generic computer – see MPEP 2106.05(f))
“training initial candidate model architectures of the set of initial candidate model architectures using a set of model architecture search training data, wherein the set of model architecture search training data comprises (i) input data generated by one or more accelerated machine learning models and (ii) task-specific ground truth output, wherein the input data comprises at least a set of embeddings generated by the one or more accelerated machine learning models,” (Training models is considered mere instructions to apply an exception using generic computer – see MPEP 2106.05(f)),
“and output the set of candidate model architectures,” (In step 2A, prong 2, outputting models recites mere data outputting, which is considered insignificant extra-solution activity – see MPEP 2106.05(g), and court cases see Mayo, 566 U.S. at 79, 101 USPQ2d at 1968; OIP Techs., Inc. v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1092-93 (Fed. Cir. 2015)),
Since the claim as a whole, looking at the additional elements individually and in combination, does not contain any other additional elements that are indicative of integration into a practical application, the claim is “directed” to an abstract idea.
In step 2B of the 101-analysis set forth in the 2019 PEG, the examiner has determined that the claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. 
As discussed above, additional elements  v, vi, vii, and viii recite mere instructions to apply the judicial exception using generic computer components, which are not indicative of significantly more. Additional element ix recites mere data outputting, and is considered an insignificant extra-solution activity. In step 2B, this insignificant extra-solution activity is well understood routine and conventional activity which includes receiving or transmitting data over a network from court case Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362 (utilizing an intermediary computer to forward information); TLI Communications LLC v. AV Auto. LLC, 823 F.3d 607, 610, 118 USPQ2d 1744, 1745 (Fed. Cir. 2016), – see MPEP 2106.05(d) (II)(i)),
 	Considering the additional elements individually and in combination, and the claim as a whole, the additional elements do not provide significantly more than the abstract idea. Therefore, the claim is not patent eligible.

Claim 16:
Regarding claim 16, it is dependent upon claim 15, and thereby incorporates the limitations of, and corresponding analysis applied to claim 15. Further, claim 16 recites the following additional element:
“The system of claim 15, wherein the one or more accelerated machine learning models are configured to be executed on one or more hardware accelerators,” (In step 2A, prong 2, this is considered mere instructions to apply an exception using generic computer – see MPEP 2106.05(f)). (In step 2B, this is also considered mere instructions to apply an exception using generic computer – see MPEP 2106.05(f)),
Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.

Claim 17:
Regarding claim 17, it is dependent upon claim 15, and thereby incorporates the limitations of, and corresponding analysis applied to claim 15. Further, claim 17 recites the following additional element: 
“The system of claim 15, wherein the plurality of pre-defined search spaces comprises at least (i) a parallel layers search space and (ii) an parallel layers selector search space,” (In step 2A, prong 2, this is considered mere instructions to apply an exception using generic computer – see MPEP 2106.05(f)). (In step 2B, this is also considered mere instructions to apply an exception using generic computer – see MPEP 2106.05(f)).
Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.

Claim 18:
Regarding claim 18, it is dependent upon claim 15, and thereby incorporates the limitations of, and corresponding analysis applied to claim 15. Further, claim 18 recites the following abstract idea:
“The system of claim 15, wherein determining the set of candidate model architectures further comprises generating a set of weights for each candidate model architecture of the set of candidate model architectures,” (this is considered a mental process, since a person can mentally evaluate and generate a set of weights (seen as numeric quantities) for each candidate model, see MPEP 2106.04(a)(2)(III)),
If claim limitations, under their broadest reasonable interpretation, covers performance of the limitations as a mathematical concept but for the recitation of generic computer components, then it falls within the mathematical concept grouping of abstract ideas. Accordingly, the claim “recites” an abstract idea.
Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.

Claim 19:
Regarding claim 19, it is dependent upon claim 18, and thereby incorporates the limitations of, and corresponding analysis applied to claim 18. Further, claim 19 recites the following additional element:
“The system of claim 18, wherein the instructions are executable by the one or more processors, to further configure the system to discard the set of weights for each candidate model architecture of the set of candidate model architectures, ” (In step 2A, prong 2, this is considered mere instructions to apply an exception using generic computer – see MPEP 2106.05(f)). (In step 2B, this is also considered mere instructions to apply an exception using generic computer – see MPEP 2106.05(f)),
Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.

Claim 20: 
Regarding claim 20, in step 1 of the 101-analysis set forth in MPEP 2106, the claim recites
“a system for generating one or more task-specific machine learning models for use in conjunction with one or more accelerated machine learning models, the system comprising: one or more processors …” and a system or machine is one of the four statutory categories of invention.
In step 2A prong 1 of the 101-analysis set forth in the MPEP 2106, the examiner has determined that the following limitations recite a process that, under the broadest reasonable interpretation, covers a mental process but for recitation of generic computer components:
“the set of candidate model architectures being generated by: identifying a selected search space, the selected search space being selected from a plurality of pre-defined search spaces,” (this is considered a mental process, a person can mentally evaluate and identify a search space, see MPEP 2106.04(a)(2)(III)),
“and determining the set of candidate model architectures from the selected search space utilizing model architecture search,” (this is considered a mental process, a person can mentally evaluate and determine a set of candidate model architectures from the selected search space, see MPEP 2106.04(a)(2)(III)),
“an evaluation of performance of each task-specific machine learning model of the set of task-specific machine learning models,” (this is considered a mental process, a person can mentally evaluate performance of each task-specific machine learning model of a set of models, see MPEP 2106.04(a)(2)(III)),
If claim limitations, under their broadest reasonable interpretation, covers performance of the limitations as a mental process but for the recitation of generic computer components, then it falls within the mental process grouping of abstract ideas. Accordingly, the claim “recites” an abstract idea.
In step 2A prong 2 of the 101-analysis set forth in MPEP 2106, the examiner has determined that the following additional elements do not integrate this judicial exception into a practical application:
“A system for generating one or more task-specific machine learning models for use in conjunction with one or more accelerated machine learning models, the system comprising: one or more processors,” (Using processors are considered mere instructions to apply an exception using generic computer – see MPEP 2106.05(f)),
“and one or more hardware storage devices that store instructions that are executable by the one or more processors to configure the system to: access a set of candidate model architectures,” (This is considered mere instructions to apply an exception using generic computer – see MPEP 2106.05(f)),
“train a set of task-specific machine learning models based upon the set of candidate model architectures, wherein each task-specific machine learning model comprises a model architecture from the set of candidate model architectures,” (Training a set of models is considered mere instructions to apply an exception using generic computer – see MPEP 2106.05(f)),
“and wherein each task-specific machine learning model is trained using a set of training data comprising (i) input data generated by one or more accelerated machine learning models and (ii) task-specific ground truth output, wherein the input data comprises at least a set of embeddings generated by the one or more accelerated machine learning models,” (This is considered mere instructions to apply an exception using generic computer – see MPEP 2106.05(f)),
“and output one or more task-specific machine learning models from the set of task-specific machine learning models,” (In step 2A, prong 2, this recites mere data outputting, which is considered insignificant extra-solution activity – see MPEP 2106.05(g), and court cases see Mayo, 566 U.S. at 79, 101 USPQ2d at 1968; OIP Techs., Inc. v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1092-93 (Fed. Cir. 2015)), 
Since the claim as a whole, looking at the additional elements individually and in combination, does not contain any other additional elements that are indicative of integration into a practical application, the claim is “directed” to an abstract idea.
In step 2B of the 101-analysis set forth in the 2019 PEG, the examiner has determined that the claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. 
As discussed above, additional elements iv, v, vi, and vii recite mere instructions to apply the judicial exception using generic computer components, which are not indicative of significantly more. Additional element viii recites mere data outputting, and is considered an insignificant extra-solution activity. In step 2B, this insignificant extra-solution activity is well understood routine and conventional activity which includes receiving or transmitting data over a network from court case Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362 (utilizing an intermediary computer to forward information); TLI Communications LLC v. AV Auto. LLC, 823 F.3d 607, 610, 118 USPQ2d 1744, 1745 (Fed. Cir. 2016), – see MPEP 2106.05(d) (II)(i)),
 	Considering the additional elements individually and in combination, and the claim as a whole, the additional elements do not provide significantly more than the abstract idea. Therefore, the claim is not patent eligible.
	
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1-12, 14, 15-18, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Kokiopoulou E. et al, (US. PG Pub 20220121906 A1), published on April 21, 2022, cited in the IDS on August 14, 2024, (hereafter, Kokiopoulou), in view of Hong S. et al., “DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation”, available at https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9923883, published on October 1, 2022, (hereafter, Hong).

Claim 1:
Regarding claim 1, Kokiopoulou teaches “a system for generating one or more task-specific machine learning models for use in conjunction with one or more accelerated machine learning models configured for execution on one or more hardware accelerators, the system comprising: one or more processors;”
See Kokiopoulou in paragraph [0068] where “the term ‘data processing apparatus’ refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can also be, or further include, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). The apparatus can optionally include, in addition to hardware, code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them. Here, Kokiopoulou shows running models on one or more processors.
Further, Kokiopoulou teaches “and one or more hardware storage devices that store instructions that are executable by the one or more processors to configure the system to: identify a selected search space, the selected search space being selected from a plurality of pre-defined search spaces;”
See Kokiopoulou in paragraph [0076] describing a "data processing apparatus for implementing machine learning models can also include, for example, special-purpose hardware accelerator units for processing common and compute-intensive parts of machine learning training or production, i.e., inference, workloads." Here, Kokiopoulou shows hardware accelerator units which relate to hardware storage devices are used to run the processors. Further, see Kokiopoulou in paragraphs [0034, 0036] describing "to determine the final architecture for the task neural network, the system 100 generates, from the continuous search space, a candidate architecture (e.g., candidate architecture 106 ) for the task neural network for performing the target machine learning task. The search space is represented by the above set of continuous architecture parameters.... In particular, to generate a candidate architecture (e.g., candidate architecture 106 ) from the search space, the system 100 generates new values for the set of architecture parameters from current values of the set of architecture parameters. The system 100 can generate the new values by performing gradient ascent search or random search (or another approximate optimization method) from the current values of the set of architecture parameters." Here, Kokiopoulou shows that the selected search space is selected from a system that generates new values for the set of architecture parameters that are part of the search space from current values of the set of architecture parameters, where the current values correspond to a number of pre-defined search spaces.
Further, Kokiopoulou teaches “determine a set of candidate model architectures from the selected search space utilizing model architecture search;”
See Kokiopoulou in paragraph [0034] mentions where "to determine the final architecture for the task neural network, the system 100 generates, from the continuous search space, a candidate architecture (e.g., candidate architecture 106 ) for the task neural network for performing the target machine learning task. The search space is represented by the above set of continuous architecture parameters." Further, See Kokiopoulou in paragraph [0049] describing "the system generates, from a search space defining a plurality of architectures, a candidate architecture for the task neural network for performing the target machine learning task (step 204 ). The search space is represented by a set of continuous architecture parameters." Here, Kokiopoulou describes determining a set of candidate model architectures from the selected search space utilizing model architecture search, where in paragraph [0049] shows the system generates from a search space defining a plurality of architectures, which represent a set of candidate model architectures.
Further, Kokiopoulou teaches “select one or more task-specific machine learning models from the trained set of task-specific machine learning models based upon an evaluation of performance of each trained task-specific machine learning model of the trained set of task-specific machine learning models.
See Kokiopoulou in paragraph [0007] describing " the described techniques identify an effective architecture for performing the new task by selecting, among candidate architectures, an architecture that has a maximum performance estimated by an evaluator neural network. The described techniques further use a continuous parametrization of model architecture which allows for efficient gradient-based optimization of the estimated performance. In particular, the best candidate architecture can be efficiently identified, i.e. identified in a manner that makes efficient use of computational resources, by maximizing the estimated performance with respect to the continuous architecture parameters with simple gradient ascent. In addition, by training the evaluator neural network to estimate the performance of input architectures on a task using meta-features and the previous model training experiments performed on related tasks, the techniques can leverage transfer learning across different training datasets associated with different tasks, thus significantly reducing the computational costs of neural network search that conventional neural network search systems would require." Here, Kokiopoulou teaches selecting an architecture that has a maximum performance estimated by an evaluator neural network, and is identified with maximizing the estimated performance. The term architecture is synonymous with models, and paragraph [0007] corresponds to select one or more task-specific machine learning models from a trained set of models from an evaluation of performance of each trained model of the set.
Further, Kokiopoulou teaches “train a set of task-specific machine learning models adapted for performance of one or more particular machine learning tasks, wherein each task-specific machine learning model of the set of task-specific machine learning models comprises a model architecture from the set of candidate model architectures determined from the selected search space utilizing model architecture search, 
See Kokiopoulou in paragraphs [0023-0025] describing " FIG. 1 shows an example neural architecture search system 100 configured to determine a final architecture for a task neural network that is configured to perform a target machine learning task. ... The system 100 receives a target training dataset 102 that is associated with the target machine learning task, i.e, that is a dataset on which a neural network should be trained in order to be able to perform the target task. The system 100 can receive the target training dataset 102 in any of a variety of ways. For example, the system 100 can receive the target training dataset 102 as an upload from a remote user of the system over a data communication network, e.g., using an application programming interface (API) made available by the system 100 . ... The system 100 then generates a target meta-features tensor 104 for the target training dataset 102. The target training dataset 102 includes a plurality of samples and a respective label for each of the samples. For example, if the target machine learning task is an image classification or recognition task, a sample in the dataset 102 can be an image and its respective label can be a ground-truth output that includes scores for each of a set of object classes, with each score representing the likelihood that the image contains an image of an object belonging to the object class. The target meta-features tensor 104 represents features (e.g., characteristics and statistics) of the target training dataset 102." 
    PNG
    media_image1.png
    846
    592
    media_image1.png
    Greyscale
 
    PNG
    media_image2.png
    945
    670
    media_image2.png
    Greyscale
 Further, Kokiopoulou notes in paragraph [0046] that "FIG. 2 is a flow diagram of an example process 200 for determining a final architecture for a task neural network to perform a target machine learning task. For convenience, the process 200 will be described as being performed by a system of one or more computers located in one or more locations. For example, a neural architecture search system, e.g., the neural architecture search system 100 of FIG. 1, appropriately programmed, can perform the process 200." Here, Kokiopoulou teaches training a set of models to perform target machine learning tasks (i.e. particular machine learning tasks), where each task-specific machine learning model of the set of task-specific machine learning models comprises a model architecture from the set of candidate model architectures determined from the selected search space utilizing model architecture search. In figures 1 and 2, Kokiopoulou shows that the candidate models were determined from the selected search space using neural architecture search (i.e. model architecture search). See Kokiopoulou for more information in paragraphs [0018-0022] regarding the various types of machine learning tasks performed by the models.
Further, Kokiopoulou teaches “… and (ii) task-specific ground truth output comprising one or more ground truth labels associated with the one or more particular machine learning tasks;” 
See Kokiopoulou in paragraph [0025] describing “the system 100 then generates a target meta-features tensor 104 for the target training dataset 102. The target training dataset 102 includes a plurality of samples and a respective label for each of the samples. For example, if the target machine learning task is an image classification or recognition task, a sample in the dataset 102 can be an image and its respective label can be a ground-truth output that includes scores for each of a set of object classes, with each score representing the likelihood that the image contains an image of an object belonging to the object class.” Here, Kokiopoulou teaches one type of training data where each task-specific machine learning model also generates a score that represents a respective ground-truth output label as the likelihood of an object belonging to the object class that is connected with a machine learning task (i.e. the task-specific ground truth output with one or more ground truth labels associated with the machine learning tasks).
However, Kokiopoulou did not explicitly teach “and wherein each task-specific machine learning model is trained using a set of training data comprising (i) input data comprising at least a set of embeddings generated by one or more accelerated machine learning models in response to input,”
In an analogous system, Hong teaches “and wherein each task-specific machine learning model is trained using a set of training data comprising (i) input data comprising at least a set of embeddings generated by one or more accelerated machine learning models in response to input,”
See Hong in abstract, page 616, describe “a multi-FPGA acceleration appliance that executes GPT-2 model inference end-to-end with low latency and high throughput in both summarization and generation stages.” See Hong in Introduction section, page 617 describing “DFX, a multi-FPGA acceleration appliance that specializes in text generation workloads covering end-to-end inference of variously sized GPT models… The FPGA based accelerator provides fully reprogrammable hardware to support new operations and larger dimensions of the evolving transformer with minimum cost.” See page 617, section II. Background, part A. GPT Language Model, where Hong shows “GPT is able to remove the encoder by using an alternate method called token embedding, a process that uses pre-trained matrices in place of the encoder.” This illustrates that the model GPT2 Hong describes is an accelerator-based model or accelerated model, and is pre-trained.
 	Further, Hong in section II Background, part A, pages 617-618, describes “GPT-2 Structure [where] the token embedding, located at the beginning of the decoder, is responsible for converting an input word(s) into an embedding vector. The input word is converted to the numeric token ID based on a dictionary. Then, the pre-trained matrices, word token embedding (WTE) and word position embedding (WPE), are indexed with the token ID to obtain the corresponding vectors. WTE contains token-related encoding, and WPE contains position-related encoding. The two vectors are added to get the embedding vector. LM head, located at the end of the decoder, has the opposite role to the token embedding. It converts the output embedding vector into the token ID. ... The selected token ID represents the generated word.” Here, Hong teaches the model embeddings include the token ID, which represent as the set of embeddings generated by one or more accelerated machine learning models in response to input. Since pre-training is part of the training process, Hong overall teaches a model that is also trained using input data generated by one or more accelerated machine learning models and the same input data that has a set of embeddings generated by more than one accelerated machine learning models.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the base reference of Kokiopoulou with the teachings of Hong by using Kokiopoulou’s teachings of a method for performing neural architecture search to generate candidate model architectures for task-specific models using accelerators, and incorporate with Hong’s teaching of the accelerator-associated or accelerated machine learning models generate training input data comprising at least a set of embeddings.
One of ordinary skill in the art would be motivated to do so because by integrating Hong’s framework into the methods of Kokiopoulou, one with ordinary skill in the art would bring a “FPGA-based accelerator provides fully reprogrammable hardware to support new operations and larger dimensions of the evolving transformer with minimum cost” (Hong, section Introduction, page 617), and “a multi-device system that adopts model parallelism and efficient network is necessary to maximize the amount of parallel computation with minimal additional latency” (Hong, section III Motivation, part C. Parallel Computing, page 619 ).

Claim 2: 
	Regarding claim 2, Kokiopoulou in view of Hong teaches the limitations in claim 1.
Further, Kokiopoulou teaches “generate task-specific output by utilizing the set of input embeddings as input to the one or more task-specific machine learning models,”
See Kokiopoulou in paragraphs [0017-18], describing "for example, the image processing task may be image classification and the output generated by the neural network for a given image may be scores for each of a set of object categories, with each score representing an estimated likelihood that the image contains an image of an object belonging to the category. As another example, the image processing task may be image embedding generation and the output generated by the neural network can be a numeric embedding of the input image." Here, Kokiopoulou teaches creating output such as scores of a set of object categories in image processing classification which relates to generating task-specific output by utilizing the set of input embeddings as input to one or more task-specific machine learning models.
Further, Kokiopoulou in view of Hong teaches “the system of claim 1, wherein the instructions are executable by the one or more processors to further configure the system to: receive a set of input embeddings generated by the one or more accelerated machine learning models,” 
See Hong in abstract, page 616, describe “a multi-FPGA acceleration appliance that executes GPT-2 model inference end-to-end with low latency and high throughput in both summarization and generation stages.” See Hong in Introduction section, page 617 describing “DFX, a multi-FPGA acceleration appliance that specializes in text generation workloads covering end-to-end inference of variously sized GPT models… The FPGA based accelerator provides fully reprogrammable hardware to support new operations and larger dimensions of the evolving transformer with minimum cost.” This illustrates that the model GPT2 Hong describes is an accelerator-based model or accelerated model. Further, Hong in section II Background, part A, pages 617-618, describes “GPT-2 Structure The token embedding, located at the beginning of the decoder, is responsible for converting an input word(s) into an embedding vector. The input word is converted to the numeric token ID based on a dictionary. Then, the pre-trained matrices, word token embedding (WTE) and word position embedding (WPE), are indexed with the token ID to obtain the corresponding vectors. WTE contains token-related encoding, and WPE contains position-related encoding. The two vectors are added to get the embedding vector. LM head, located at the end of the decoder, has the opposite role to the token embedding. It converts the output embedding vector into the token ID. ... The selected token ID represents the generated word.” Here, the model embeddings include the token ID, which represent as the set of embeddings generated by one or more accelerated machine learning models in response to input. Hong teaches an accelerated machine learning model where instructions executable by the one or more processors programs the model to receive a set of input embeddings.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the base reference of Kokiopoulou with the teachings of Hong by using Kokiopoulou’s teachings of a method for performing neural architecture search to generate candidate model architectures for task-specific models using accelerators, with Hong’s teaching of the accelerator-associated or accelerated machine learning models generate training input data comprising at least a set of embeddings.
One of ordinary skill in the art would be motivated to do so because by integrating Hong’s framework into the methods of Kokiopoulou, one with ordinary skill in the art would bring a “FPGA-based accelerator provides fully reprogrammable hardware to support new operations and larger dimensions of the evolving transformer with minimum cost ” (Hong, section Introduction, page 617), and “a multi-device system that adopts model parallelism and efficient network is necessary to maximize the amount of parallel computation with minimal additional latency” (Hong, section III Motivation, part C. Parallel Computing, page 619 ).

Claim 3:
Regarding claim 3, Kokiopoulou in view of Hong teaches the limitations in claim 1.
Referring to claim 3, Kokiopoulou teaches “the system of claim 1, wherein the one or more accelerated machine learning models are configured to be executed on one or more hardware accelerators,”
See Kokiopoulou in paragraph [0076] describing "data processing apparatus for implementing machine learning models can also include, for example, special-purpose hardware accelerator units for processing common and compute-intensive parts of machine learning training or production, i.e., inference, workloads." Here, Kokiopoulou describes hardware accelerator systems are used for running machine learning models and teaches machine learning models are configured to be executed on one or more hardware accelerators.

Claim 4: 
Regarding claim 4, Kokiopoulou in view of Hong teaches the limitations in claim 3.
Further, Kokiopoulou teaches "the system of claim 3, wherein the one or more hardware accelerators comprise one or more field-programmable gate arrays (FPGAs), graphics processing units (GPUs), tensor processing units (TPUs), or application-specific integrated circuits (ASICs). "
	See Kokiopoulou in paragraphs [0068, 0072] describe "the term 'data processing apparatus' refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can also be, or further include, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit)....The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by special purpose logic circuitry, e.g., an FPGA or an ASIC, or by a combination of special purpose logic circuitry and one or more programmed computers." Here, Kokiopoulou teaches using hardware accelerators that comprise of either an FPGA or an ASIC.

Claim 5: 
Regarding claim 5, Kokiopoulou in view of Hong teaches the limitations in claim 1.
Further, Kokiopoulou teaches "the system of claim 1, wherein the plurality of pre-defined search spaces comprises at least (i) a parallel layers search space and (ii) a parallel layers selector search space,"
See Kokiopoulou in paragraph [0033] describing "the set of continuous architecture parameters (including all possible values of the parametrization weights, activation weights, and embedding weights) defines a continuous search space for searching for a final architecture for the task neural network. Searching for the final architecture includes learning continuous parameters, for example, learning u:={{α}}, {β}, {γ}}, where u represents an encoding of the final architecture." Further, see Kokiopoulou in paragraph [0081] describe "similarly, while operations are depicted in the drawings and recited in the claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous." Here, Kokiopoulou describes that parallel processing is potentially used in the operations involving search spaces, which relates to parallel layer search space. Further, Kokiopoulou teaches that the pre-defined search spaces comprise of at least a parallel layer search space and a parallel layer selector search space.

Claim 6:
	Regarding claim 6, Kokiopoulou in view of Hong teaches the limitations in claim 1.
Further, Kokiopoulou teaches “the system of claim 1, wherein the selected search space is selected based upon one or more computational constraints,”
	See Kokiopoulou in paragraph [0033] describing “the set of continuous architecture parameters (including all possible values of the parametrization weights, activation weights, and embedding weights) defines a continuous search space for searching for a final architecture for the task neural network. Searching for the final architecture includes learning continuous parameters, for example, learning u:={{α}}, {β}, {γ}}, where u represents an encoding of the final architecture”, and further in paragraph [0007] describing “the described techniques identify an effective architecture for performing the new task by selecting, among candidate architectures, an architecture that has a maximum performance estimated by an evaluator neural network. The described techniques further use a continuous parametrization of model architecture which allows for efficient gradient-based optimization of the estimated performance. In particular, the best candidate architecture can be efficiently identified, i.e. identified in a manner that makes efficient use of computational resources, by maximizing the estimated performance with respect to the continuous architecture parameters with simple gradient ascent. In addition, by training the evaluator neural network to estimate the performance of input architectures on a task using meta-features and the previous model training experiments performed on related tasks, the techniques can leverage transfer learning across different training datasets associated with different tasks, thus significantly reducing the computational costs of neural network search that conventional neural network search systems would require.” Here, Kokiopoulou teaches identifying a search space that requires an architecture that has a maximum performance that makes efficient use of computational resources, by maximizing the estimated performance with respect to the continuous architecture parameters. The computational constraints, where the selected search space is selected upon, are based on maximizing performance of the model’s continuous parameters. Further, see Kokiopoulou in paragraphs [0036-0037, 0049-0050, 0062] for more information on model parameters and model performance metrics.

Claim 7:
	Regarding claim 7, Kokiopoulou in view of Hong teaches the limitations in claim 5.
Further, Kokiopoulou teaches “the system of claim 5, wherein, when the selected search space comprises the parallel layers selector search space, the input data further comprises intermediate output generated by the one or more accelerated machine learning models,”
See Kokiopoulou in paragraph [0003] describe "neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters." Further, see Kokiopoulou describe in paragraph [0076] that "data processing apparatus for implementing machine learning models can also include, for example, special-purpose hardware accelerator units for processing common and compute-intensive parts of machine learning training or production, i.e., inference, workloads." Here, the examiner interprets intermediate output to be output from middle or hidden layers of a model. The accelerated machine learning model stands for a model that uses accelerator system units. In paragraphs 0003 and 0076, Kokiopoulou describes that the input data comprises intermediate output generated by one or more of the accelerated machine learning models.


Claim 8: 
Regarding claim 8, Kokiopoulou in view of Hong teaches the limitations in claim 1.
	Further, Kokiopoulou teaches "the system of claim 1, wherein determining the set of candidate model architectures comprises utilizing a neural architecture search framework."
See Kokiopoulou in paragraph [0009] "FIG. 1 shows an example neural architecture search system for determining a final architecture for a task neural network to perform a target machine learning task." Here, Kokiopoulou shows in figure 1 that when determining the set of candidate model architectures involve using a neural architecture search system, which relates to determining the set of candidate model architectures comprises utilizing a neural architecture search framework.

    PNG
    media_image3.png
    526
    656
    media_image3.png
    Greyscale

Claim 9: 
Regarding claim 9, Kokiopoulou in view of Hong teaches the limitations in claim 8.
Further, Kokiopoulou teaches “the system of claim 8, wherein determining the set of candidate model architectures comprises: generating a set of initial candidate model architectures by sampling from the selected search space;”
	See Kokiopoulou in paragraph [0025] describe "the system 100 then generates a target meta-features tensor 104 for the target training dataset 102 . The target training dataset 102 includes a plurality of samples and a respective label for each of the samples." Further, in paragraphs [0035] and [0038] "the system 100 repeatedly generates candidate architectures (e.g., candidate architectures 106, 108, and 110 ) from the search space and evaluates performance of each of the generated candidate architectures... In some other implementations, the system 100 performs a random search from the current values of the set of architecture parameters in the search space, and returns a result of the random search as the new values of the set of architecture parameters." Noting that examiner interprets initial model architectures to be ones automatically generated by a system upon finding the model architectures in the search space, and sampling is interpreted to find and obtain a random set of values. Here, Kokiopoulou describes the system 100 creating candidate model architectures 106, 108, and 110 by sampling from a plurality of samples from paragraph [0025], and in paragraph [0038] mentions performing a random search similar to sampling from the selected search space.
Further, Kokiopoulou teaches “training initial candidate model architectures of the set of initial candidate model architectures using a set of NAS training data,”
	See Kokiopoulou in paragraph [0056] describe "FIG. 3 is a flow diagram of an example process for training an evaluator neural network. For convenience, the process 300 will be described as being performed by a system of one or more computers located in one or more locations. For example, a neural architecture search system, e.g., the neural architecture search system 100 of FIG. 1, appropriately programmed, can perform the process 300." Here, Kokiopoulou teaches from figures 1 and 3 on the training process for the initial candidate model architectures for a set of initial candidate model architectures using a neural architecture search system 100 from figure 1 (i.e. a set of NAS or neural architecture search training data.) 

    PNG
    media_image4.png
    913
    706
    media_image4.png
    Greyscale

Further, see Kokiopoulou mention in paragraph [0062] "in particular, the system can train an instance of neural network having the sample architecture on the sample machine learning task to determine values of parameters of the instance of neural network having the sample architecture. The system can then determine an accuracy score of the trained instance of neural network based on the performance of the trained instance of neural network on the sample machine learning task. For example, the accuracy score can represent an accuracy of the trained instance on a validation set as measured by an appropriate accuracy measure." Kokiopoulou teaches in paragraph 0062 how training the initial candidate model architectures occurs.
Further, Kokiopoulou teaches “evaluating whether each of the initial candidate model architectures of the set of initial candidate model architectures satisfies one or more performance metrics,”
See Kokiopoulou describe in paragraph [0062] "in particular, the system can train an instance of neural network having the sample architecture on the sample machine learning task to determine values of parameters of the instance of neural network having the sample architecture. The system can then determine an accuracy score of the trained instance of neural network based on the performance of the trained instance of neural network on the sample machine learning task. For example, the accuracy score can represent an accuracy of the trained instance on a validation set as measured by an appropriate accuracy measure." Further, see Kokiopoulou in paragraph [0063] "the system adds an evaluator training example to an evaluator training dataset (step 310 ). The evaluator training example includes (i) the sample meta-features tensor associated with the sample training dataset, (ii) data specifying the at least one sample architecture, and (iii) the generated sample performance score." Here, Kokiopoulou in paragraph 0063 explicitly mentions an evaluator that evaluates data that includes generated sample performance score, which corresponds to evaluating the initial candidate model architectures if they satisfy one or more performance metrics.
Further, Kokiopoulou teaches “and defining the set of candidate model architectures as the initial candidate model architectures of the set of initial candidate model architectures that satisfy the one or more performance metrics,”
	See Kokiopoulou in paragraph [0062] "in particular, the system can train an instance of neural network having the sample architecture on the sample machine learning task to determine values of parameters of the instance of neural network having the sample architecture. The system can then determine an accuracy score of the trained instance of neural network based on the performance of the trained instance of neural network on the sample machine learning task. For example, the accuracy score can represent an accuracy of the trained instance on a validation set as measured by an appropriate accuracy measure." Here, Kokiopoulou teaches using a performance metric of an accuracy score, which relates to defining whether each of the initial candidate model architectures of the set of initial candidate model architectures satisfies one or more performance metrics.

Claim 10: 
Regarding claim 10, Kokiopoulou in view of Hong teaches the limitations in claim 9.
Further, Kokiopoulou teaches “the system of claim 9, wherein the set of NAS training data also comprises (i) input data generated by the one or more accelerated machine learning models and (ii) task-specific ground truth output,”
See Kokiopoulou in paragraph [0039] describing "to evaluate performance of the candidate architecture 106 , the system 100 uses an evaluator neural network 120 . The evaluator neural network 120 has been trained to process an input including (i) a meta-features tensor of a given training dataset associated with a given machine learning task, and (ii) data specifying a given architecture to generate a performance score that estimates a performance of the given architecture on the given machine learning task. The evaluator neural network 120 can be trained using machine learning techniques such as stochastic gradient descent with momentum." Further, see Kokiopoulou in paragraph [0025] describing " system 100 then generates a target meta-features tensor 104 for the target training dataset 102 . The target training dataset 102 includes a plurality of samples and a respective label for each of the samples. For example, if the target machine learning task is an image classification or recognition task, a sample in the dataset 102 can be an image and its respective label can be a ground-truth output that includes scores for each of a set of object classes, with each score representing the likelihood that the image contains an image of an object belonging to the object class. The target meta-features tensor 104 represents features (e.g., characteristics and statistics) of the target training dataset 102." Here, Kokiopoulou teaches the training data from a neural architecture system comprises of a meta-features tensor of a given training dataset (i.e. input data generated by one or more accelerated machine learning models) and ground-truth output that includes scores for each of a set of object classes (i.e. a task-specific ground truth output).

Claim 11: 
Regarding claim 11, Kokiopoulou in view of Hong teaches the limitations in claim 10.
Further, Kokiopoulou teaches “the system of claim 10, wherein the input data of the set of NAS training data comprises intermediate output generated by the one or more accelerated machine learning models when the selected search space comprises a parallel layers selector search space,”
See Kokiopoulou in paragraph [0003] describe "neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters." Further, see Kokiopoulou describe in paragraph [0076] that "data processing apparatus for implementing machine learning models can also include, for example, special-purpose hardware accelerator units for processing common and compute-intensive parts of machine learning training or production, i.e., inference, workloads." See paragraph [0081] for more information. Here, the examiner interprets intermediate output to be output from middle or hidden layers of a model. The accelerated machine learning model stands for a model that uses accelerator system units. In paragraphs 0003 and 0076, Kokiopoulou describes that the input data comprises intermediate output generated by one or more of the accelerated machine learning models and comprises parallel layers selector search space since neural networks have one or more hidden layers that work in parallel.

Claim 12: 
	Regarding claim 12, Kokiopoulou in view of Hong teaches the limitations in claim 9.
Further, Kokiopoulou teaches “the system of claim 9, wherein determining the set of candidate model architectures comprises generating a set of weights for each candidate model architecture of the set of candidate model architectures.”	
See Kokiopoulou describe in paragraph [0033] "the set of continuous architecture parameters (including all possible values of the parametrization weights, activation weights, and embedding weights) defines a continuous search space for searching for a final architecture for the task neural network. Searching for the final architecture includes learning continuous parameters, for example, learning u:={{α}}, {β}, {γ}}, where u represents an encoding of the final architecture." Here, Kokiopoulou teaches that to determine a set of candidate model architectures, the system also generates a set of parameters that include a set of weights such as Kokiopoulou illustrates with parametrization weights, activation weights, and embedding weights.

Claim 14:
	Regarding claim 14, Kokiopoulou in view of Hong teaches the limitations in claim 1.
Further, Kokiopoulou teaches “the system of claim 1, wherein the evaluation of performance of each task-specific machine learning model of the set of task-specific machine learning models utilizes a set of validation data, wherein the set of validation data also comprises (i) input data generated by the one or more accelerated machine learning models and (ii) task-specific ground truth output,”
See Kokiopoulou describe in paragraph [0039] "to evaluate performance of the candidate architecture 106, the system 100 uses an evaluator neural network 120 . The evaluator neural network 120 has been trained to process an input including (i) a meta-features tensor of a given training dataset associated with a given machine learning task, and (ii) data specifying a given architecture to generate a performance score that estimates a performance of the given architecture on the given machine learning task. The evaluator neural network 120 can be trained using machine learning techniques such as stochastic gradient descent with momentum". Further, see Kokiopoulou mention in paragraph [0025] "the system 100 then generates a target meta-features tensor 104 for the target training dataset 102 . The target training dataset 102 includes a plurality of samples and a respective label for each of the samples. For example, if the target machine learning task is an image classification or recognition task, a sample in the dataset 102 can be an image and its respective label can be a ground-truth output that includes scores for each of a set of object classes, with each score representing the likelihood that the image contains an image of an object belonging to the object class. The target meta-features tensor 104 represents features (e.g., characteristics and statistics) of the target training dataset 102." Here, Kokiopoulou teaches the training data from a neural architecture system comprises of a meta-features tensor of a given training dataset (i.e. input data generated by one or more accelerated machine learning models) and ground-truth output that includes scores for each of a set of object classes (i.e. a task-specific ground truth output).



Claim 15:
Regarding claim 15, Kokiopoulou teaches “a system for generating a set of model architectures for a task-specific machine learning model for use in conjunction with an accelerated machine learning model, the system comprising: one or more processors,”
See Kokiopoulou in paragraph [0068] where “the term “data processing apparatus” refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can also be, or further include, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). The apparatus can optionally include, in addition to hardware, code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them. Here, Kokiopoulou shows running models on one or more processors.
Further, Kokiopoulou teaches “one or more hardware storage devices that store instructions that are executable by the one or more processors to configure the system to: identify a selected search space, the selected search space being selected from a plurality of pre-defined search spaces;”
See Kokiopoulou in paragraph [0076] describing a "data processing apparatus for implementing machine learning models can also include, for example, special-purpose hardware accelerator units for processing common and compute-intensive parts of machine learning training or production, i.e., inference, workloads." Further, Kokiopoulou describes in paragraph [0068] that "the term ‘data processing apparatus’ refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers." Here, Kokiopoulou teaches hardware accelerator units which relate to hardware storage devices are used to run the processors, and the processors themselves. 
Further, see Kokiopoulou in paragraphs [0034, 0036] describing "to determine the final architecture for the task neural network, the system 100 generates, from the continuous search space, a candidate architecture (e.g., candidate architecture 106 ) for the task neural network for performing the target machine learning task. The search space is represented by the above set of continuous architecture parameters.... In particular, to generate a candidate architecture (e.g., candidate architecture 106 ) from the search space, the system 100 generates new values for the set of architecture parameters from current values of the set of architecture parameters. The system 100 can generate the new values by performing gradient ascent search or random search (or another approximate optimization method) from the current values of the set of architecture parameters." Here, Kokiopoulou shows that the selected search space is selected from a system that generates new values for the set of architecture parameters that are part of the search space from current values of the set of architecture parameters, where the current values correspond to a number of pre-defined search spaces (i.e. identify a selected search space, the selected search space being selected from a plurality of pre-defined search spaces).
Further, Kokiopoulou teaches “determine a set of candidate model architectures from the selected search space utilizing model architecture search,” 
See Kokiopoulou in paragraph [0034] mentions where "to determine the final architecture for the task neural network, the system 100 generates, from the continuous search space, a candidate architecture (e.g., candidate architecture 106 ) for the task neural network for performing the target machine learning task. The search space is represented by the above set of continuous architecture parameters." Further, See Kokiopoulou in paragraph [0049] describing "the system generates, from a search space defining a plurality of architectures, a candidate architecture for the task neural network for performing the target machine learning task (step 204 ). The search space is represented by a set of continuous architecture parameters." Here, Kokiopoulou describes determining a set of candidate model architectures from the selected search space utilizing model architecture search, where in paragraph [0049] shows the system generates from a search space defining a plurality of architectures, which represent a set of candidate model architectures from the selected search space utilizing model architecture search.
 Further, Kokiopoulou teaches “wherein determining the set of candidate model architectures comprises: generating a set of initial candidate model architectures by sampling from the selected search space;”
See Kokiopoulou in paragraphs [0035] and [0038] that "the system 100 repeatedly generates candidate architectures (e.g., candidate architectures 106, 108, and 110 ) from the search space and evaluates performance of each of the generated candidate architectures... In some other implementations, the system 100 performs a random search from the current values of the set of architecture parameters in the search space, and returns a result of the random search as the new values of the set of architecture parameters." 
Kokiopoulou mentions in paragraph [0060] “the system samples, from the search space, at least one sample architecture (step 306 ).” Noting that examiner interprets initial model architectures to be ones automatically generated by a system upon finding the model architectures in the search space, and sampling is interpreted to find and obtain a random set of values. Here, Kokiopoulou describes the system 100 creating a set of initial candidate model architectures 106, 108, and 110 and samples at least one sample architecture model from the selected search space. For more information, see paragraphs [0059-0063].
Further, Kokiopoulou teaches “training initial candidate model architectures of the set of initial candidate model architectures using a set of model architecture search training data,” 
See Kokiopoulou in paragraph [0056] describe "FIG. 3 is a flow diagram of an example process for training an evaluator neural network. For convenience, the process 300 will be described as being performed by a system of one or more computers located in one or more locations. For example, a neural architecture search system, e.g., the neural architecture search system 100 of FIG. 1, appropriately programmed, can perform the process 300." Here, Kokiopoulou teaches from figures 1 and 3 on the training process for the initial candidate model architectures for a set of initial candidate model architectures using a neural architecture search system 100 from figure 1 (i.e. a set of model architecture search training data.)

    PNG
    media_image4.png
    913
    706
    media_image4.png
    Greyscale
Further, see Kokiopoulou mention in paragraph [0062] "in particular, the system can train an instance of neural network having the sample architecture on the sample machine learning task to determine values of parameters of the instance of neural network having the sample architecture. The system can then determine an accuracy score of the trained instance of neural network based on the performance of the trained instance of neural network on the sample machine learning task. For example, the accuracy score can represent an accuracy of the trained instance on a validation set as measured by an appropriate accuracy measure." Kokiopoulou teaches in paragraph 0062 how training the initial candidate model architectures occurs.
Further, Kokiopoulou teaches “evaluating whether each of the initial candidate model architectures of the set of initial candidate model architectures satisfies one or more performance metrics;”
See Kokiopoulou describe in paragraph [0062] "in particular, the system can train an instance of neural network having the sample architecture on the sample machine learning task to determine values of parameters of the instance of neural network having the sample architecture. The system can then determine an accuracy score of the trained instance of neural network based on the performance of the trained instance of neural network on the sample machine learning task. For example, the accuracy score can represent an accuracy of the trained instance on a validation set as measured by an appropriate accuracy measure." Further, see Kokiopoulou in paragraph [0063] "the system adds an evaluator training example to an evaluator training dataset (step 310 ). The evaluator training example includes (i) the sample meta-features tensor associated with the sample training dataset, (ii) data specifying the at least one sample architecture, and (iii) the generated sample performance score." Here, Kokiopoulou in paragraph 0063 explicitly mentions an evaluator that evaluates data that includes generated sample performance score, which corresponds to evaluating the initial candidate model architectures if they satisfy one or more performance metrics.
Further, Kokiopoulou teaches “defining the set of candidate model architectures as the initial candidate model architectures of the set of initial candidate model architectures that satisfy the one or more performance metrics; and output the set of candidate model architectures,”
See Kokiopoulou teach in paragraphs [0034-0035] and in figure 1 that "the system 100 generates, from the continuous search space, a candidate architecture (e.g., candidate architecture 106 ) for the task neural network for performing the target machine learning task. The search space is represented by the above set of continuous architecture parameters. The system 100 repeatedly generates candidate architectures (e.g., candidate architectures 106 , 108 , and 110 ) from the search space and evaluates performance of each of the generated candidate architectures. " 
    PNG
    media_image5.png
    948
    784
    media_image5.png
    Greyscale
 In addition, Kokiopoulou mentions in paragraphs [0043-0044] “after generating the candidate architectures and determining their respective candidate performance scores, the system 100 identifies, as the final architecture 140 , a candidate architecture that has a maximum candidate performance score among the generated candidate architectures. The system 100 can then output architecture data 150 that specifies the final architecture 140 of the neural network, i.e., data specifying the layers that are part of the final architecture, the connectivity between the layers, and the operations performed by the layers. For example, the system 100 can output the architecture data 150 to the user who submitted the target training dataset.” Note, the examiner construes the term ‘set’ to mean a group or collection of things, related to a same subject. Here, Kokiopoulou teaches a neural architecture search system 100 that outputs the set of candidate model architectures with the candidate architectures 106, 108, and 110, and a set that includes a final architecture 140 and its architecture data 150 as shown in figure 1. Since the system repeatedly generates candidate architectures such as candidate architectures 106, 108, and 110, this counts as output the initial set of candidate model architectures. 
Further, see Kokiopoulou in paragraph [0062] "in particular, the system can train an instance of neural network having the sample architecture on the sample machine learning task to determine values of parameters of the instance of neural network having the sample architecture. The system can then determine an accuracy score of the trained instance of neural network based on the performance of the trained instance of neural network on the sample machine learning task. For example, the accuracy score can represent an accuracy of the trained instance on a validation set as measured by an appropriate accuracy measure." Here, Kokiopoulou teaches using a performance metric of an accuracy score, which relates to defining whether each of the initial candidate model architectures of the set of initial candidate model architectures satisfies one or more performance metrics. This performance metric is part of the performance scores in figure 1. Later, Kokiopoulou in figure 1 teaches that after the system evaluates model performance that satisfies the one or more performance metrics, the system outputs a set including a final architecture 140 along with its architecture data 150 of candidate model architectures.
Further, Kokiopoulou teaches “training initial candidate model architectures of the set of initial candidate model architectures using a set of model architecture search training data, wherein the set of model architecture search training data comprises … (ii) task-specific ground truth output, …”
see Kokiopoulou mention in paragraph [0025] "the system 100 then generates a target meta-features tensor 104 for the target training dataset 102. The target training dataset 102 includes a plurality of samples and a respective label for each of the samples. For example, if the target machine learning task is an image classification or recognition task, a sample in the dataset 102 can be an image and its respective label can be a ground-truth output that includes scores for each of a set of object classes, with each score representing the likelihood that the image contains an image of an object belonging to the object class. The target meta-features tensor 104 represents features (e.g., characteristics and statistics) of the target training dataset 102." Here, Kokiopoulou teaches that one type of training data is from a neural architecture system comprising of a ground-truth output that includes scores for each of a set of object classes (i.e. a task-specific ground truth output).
However, Kokiopoulou did not explicitly teach “training initial candidate model architectures of the set of initial candidate model architectures using a set of model architecture search training data, wherein the set of model architecture search training data comprises (i) input data generated by one or more accelerated machine learning models, and wherein the input data comprises at least a set of embeddings generated by the one or more accelerated machine learning models,”
In an analogous system, Hong teaches “wherein the set of model architecture search training data comprises (i) input data generated by one or more accelerated machine learning models,” and “wherein the input data comprises at least a set of embeddings generated by the one or more accelerated machine learning models,”
See Hong in abstract, page 616, describe “a multi-FPGA acceleration appliance that executes GPT-2 model inference end-to-end with low latency and high throughput in both summarization and generation stages.” Further, see Hong in Introduction section, page 617 describing “DFX, a multi-FPGA acceleration appliance that specializes in text generation workloads covering end-to-end inference of variously sized GPT models… The FPGA based accelerator provides fully reprogrammable hardware to support new operations and larger dimensions of the evolving transformer with minimum cost.” See further in page 617, section II. Background, part A. GPT Language Model, where Hong shows “GPT is able to remove the encoder by using an alternate method called token embedding, a process that uses pre-trained matrices in place of the encoder.” This illustrates that the model GPT2 Hong describes is an accelerator-based model or accelerated model, and is pre-trained.
Further, Hong in section II Background, part A, pages 617-618, describes “GPT-2 Structure The token embedding, located at the beginning of the decoder, is responsible for converting an input word(s) into an embedding vector. The input word is converted to the numeric token ID based on a dictionary. Then, the pre-trained matrices, word token embedding (WTE) and word position embedding (WPE), are indexed with the token ID to obtain the corresponding vectors. WTE contains token-related encoding, and WPE contains position-related encoding. The two vectors are added to get the embedding vector. LM head, located at the end of the decoder, has the opposite role to the token embedding. It converts the output embedding vector into the token ID. ... The selected token ID represents the generated word.” Here, Hong teaches the model embeddings include the token ID, which represent as the set of embeddings generated by one or more accelerated machine learning models in response to input. Since pre-training is part of the training process, Hong overall teaches a model that is also trained using input data generated by one or more accelerated machine learning models and the same input data that has a set of embeddings generated by more than one accelerated machine learning models.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the base reference of Kokiopoulou with the teachings of Hong by using Kokiopoulou’s teachings of a method for performing neural architecture search to generate candidate model architectures for task-specific models using accelerators, and incorporate with Hong’s teaching of the accelerator-associated or accelerated machine learning models generate training input data comprising at least a set of embeddings.
One of ordinary skill in the art would be motivated to do so because by integrating Hong’s framework into the methods of Kokiopoulou, one with ordinary skill in the art would bring a “FPGA-based accelerator provides fully reprogrammable hardware to support new operations and larger dimensions of the evolving transformer with minimum cost ” (Hong, section Introduction, page 617), and “a multi-device system that adopts model parallelism and efficient network is necessary to maximize the amount of parallel computation with minimal additional latency” (Hong, section III Motivation, part C. Parallel Computing, page 619 ).

Claim 16:
	Regarding claim 16, Kokiopoulou in view of Hong teaches the limitations in claim 15.
Further, Kokiopoulou teaches “the system of claim 15, wherein the one or more accelerated machine learning models are configured to be executed on one or more hardware accelerators,”	
See Kokiopoulou in paragraph [0076] describing "data processing apparatus for implementing machine learning models can also include, for example, special-purpose hardware accelerator units for processing common and compute-intensive parts of machine learning training or production, i.e., inference, workloads." Here, Kokiopoulou describes hardware accelerator systems are used for running machine learning models and teaches machine learning models are configured to be executed on one or more hardware accelerators.

Claim 17:
	Regarding claim 17, Kokiopoulou in view of Hong teaches the limitations in claim 15.
Further, Kokiopoulou teaches "the system of claim 15, wherein the plurality of pre-defined search spaces comprises at least (i) a parallel layers search space and (ii) an parallel layers selector search space," 
See Kokiopoulou in paragraph [0033] describing "the set of continuous architecture parameters (including all possible values of the parametrization weights, activation weights, and embedding weights) defines a continuous search space for searching for a final architecture for the task neural network. Searching for the final architecture includes learning continuous parameters, for example, learning u:={{α}}, {β}, {γ}}, where u represents an encoding of the final architecture." Further, see Kokiopoulou in paragraph [0081] describe "similarly, while operations are depicted in the drawings and recited in the claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous." Here, Kokiopoulou describes that parallel processing is potentially used in the operations involving search spaces, which relates to parallel layer search space. Further, Kokiopoulou teaches that the pre-defined search spaces comprise of at least a parallel layer search space and a parallel layer selector search space.

Claim 18:
	Regarding claim 18, Kokiopoulou in view of Hong teaches the limitations in claim 15.
Further, Kokiopoulou teaches “the system of claim 15, wherein determining the set of candidate model architectures further comprises generating a set of weights for each candidate model architecture of the set of candidate model architectures,”	
See Kokiopoulou describe in paragraph [0033] "the set of continuous architecture parameters (including all possible values of the parametrization weights, activation weights, and embedding weights) defines a continuous search space for searching for a final architecture for the task neural network. Searching for the final architecture includes learning continuous parameters, for example, learning u:={{α}}, {β}, {γ}}, where u represents an encoding of the final architecture." Here, Kokiopoulou teaches that to determine a set of candidate model architectures, the system also generates a set of parameters that include a set of weights such as Kokiopoulou illustrates with parametrization weights, activation weights, and embedding weights.

Claim 20: 
Regarding claim 20, Kokiopoulou teaches “a system for generating one or more task-specific machine learning models for use in conjunction with one or more accelerated machine learning models, the system comprising: one or more processors;”
See Kokiopoulou in paragraph [0068] where “the term “data processing apparatus” refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can also be, or further include, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). The apparatus can optionally include, in addition to hardware, code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them. Here, Kokiopoulou shows running models on one or more processors.
Further, Kokiopoulou teaches “and one or more hardware storage devices that store instructions that are executable by the one or more processors to configure the system to: access a set of candidate model architectures, the set of candidate model architectures being generated by: identifying a selected search space, the selected search space being selected from a plurality of pre-defined search spaces;”
See Kokiopoulou in paragraph [0076] describing a "data processing apparatus for implementing machine learning models can also include, for example, special-purpose hardware accelerator units for processing common and compute-intensive parts of machine learning training or production, i.e., inference, workloads." Here, Kokiopoulou shows hardware accelerator units which relate to hardware storage devices are used to run the processors. 
Further, see Kokiopoulou in paragraphs [0034, 0036] describing "to determine the final architecture for the task neural network, the system 100 generates, from the continuous search space, a candidate architecture (e.g., candidate architecture 106 ) for the task neural network for performing the target machine learning task. The search space is represented by the above set of continuous architecture parameters.... In particular, to generate a candidate architecture (e.g., candidate architecture 106 ) from the search space, the system 100 generates new values for the set of architecture parameters from current values of the set of architecture parameters. The system 100 can generate the new values by performing gradient ascent search or random search (or another approximate optimization method) from the current values of the set of architecture parameters." Here, Kokiopoulou shows that the selected search space is selected from a system that generates new values for the set of architecture parameters that are part of the search space from current values of the set of architecture parameters, where the current values correspond to a number of pre-defined search spaces.
Further, Kokiopoulou teaches “and determining the set of candidate model architectures from the selected search space utilizing model architecture search;”
See Kokiopoulou in paragraph [0009] "FIG. 1 shows an example neural architecture search system for determining a final architecture for a task neural network to perform a target machine learning task." Here, Kokiopoulou shows in figure 1 that when determining the set of candidate model architectures involve using a neural architecture search system, which relates to determining the set of candidate model architectures comprises utilizing a neural architecture search framework. 

    PNG
    media_image3.png
    526
    656
    media_image3.png
    Greyscale

Further, see Kokiopoulou describe in paragraphs [0035] and [0038] "the system 100 repeatedly generates candidate architectures (e.g., candidate architectures 106, 108, and 110 ) from the search space and evaluates performance of each of the generated candidate architectures... In some other implementations, the system 100 performs a random search from the current values of the set of architecture parameters in the search space, and returns a result of the random search as the new values of the set of architecture parameters." Here, Kokiopoulou teaches determining the set of candidate model architectures, such as candidate architectures 106, 108, and 110, from the selected search space using model architecture search.
Further, Kokiopoulou teaches “and output one or more task-specific machine learning models from the set of task-specific machine learning models based upon an evaluation of performance of each task-specific machine learning model of the set of task-specific machine learning models,”
See Kokiopoulou describe in paragraphs [0043-0044] “after generating the candidate architectures and determining their respective candidate performance scores, the system 100 identifies, as the final architecture 140, a candidate architecture that has a maximum candidate performance score among the generated candidate architectures. The system 100 can then output architecture data 150 that specifies the final architecture 140 of the neural network, i.e., data specifying the layers that are part of the final architecture, the connectivity between the layers, and the operations performed by the layers. For example, the system 100 can output the architecture data 150 to the user who submitted the target training dataset.” 
Here in paragraph 0043, Kokiopoulou teaches that the system 100 generates a final architecture model 140, which relates in outputs one or more task-specific machine learning models that has a maximum candidate performance score among the generated candidate architectures (i.e. output one or more task-specific machine learning models from the set of task-specific machine learning models based upon an evaluation of performance of each task-specific machine learning model of the set of task-specific machine learning models).

    PNG
    media_image5.png
    948
    784
    media_image5.png
    Greyscale

Further, see Kokiopoulou in paragraph [0062] illustrating an example of a performance score, and paragraphs [0014] and [0025] for more information.
Further, Kokiopoulou teaches “train a set of task-specific machine learning models based upon the set of candidate model architectures, wherein each task-specific machine learning model comprises a model architecture from the set of candidate model architectures, and wherein each task-specific machine learning model is trained using a set of training data comprising … (ii) task-specific ground truth output,”
See Kokiopoulou in paragraph [0025] describing “the system 100 then generates a target meta-features tensor 104 for the target training dataset 102. The target training dataset 102 includes a plurality of samples and a respective label for each of the samples. For example, if the target machine learning task is an image classification or recognition task, a sample in the dataset 102 can be an image and its respective label can be a ground-truth output that includes scores for each of a set of object classes, with each score representing the likelihood that the image contains an image of an object belonging to the object class.” Kokiopoulou shows that the target training dataset 102 (i.e. training data) includes a set of training data including samples and its respective labels. Here, Kokiopoulou teaches one type of training data where each task-specific machine learning model also generates a score that represents a respective ground-truth output label as the likelihood of an object belonging to the object class that is connected with a machine learning task (i.e. each task-specific machine learning model is trained using a set of training data comprising of (ii) task-specific ground truth output).
However, Kokiopoulou fails to teach “train a set of task-specific machine learning models …wherein each task-specific machine learning model is trained using a set of training data comprising (i) input data generated by one or more accelerated machine learning models,… wherein the input data comprises at least a set of embeddings generated by the one or more accelerated machine learning models;”
In an analogous system, Hong teaches “train a set of task-specific machine learning models …wherein each task-specific machine learning model is trained using a set of training data comprising (i) input data generated by one or more accelerated machine learning models,… wherein the input data comprises at least a set of embeddings generated by the one or more accelerated machine learning models;”
See Hong in abstract, page 616, describe “a multi-FPGA acceleration appliance that executes GPT-2 model inference end-to-end with low latency and high throughput in both summarization and generation stages.” Further, see Hong in Introduction section, page 617 describing “DFX, a multi-FPGA acceleration appliance that specializes in text generation workloads covering end-to-end inference of variously sized GPT models… The FPGA based accelerator provides fully reprogrammable hardware to support new operations and larger dimensions of the evolving transformer with minimum cost.” See further in page 617, section II. Background, part A. GPT Language Model, where Hong shows “GPT is able to remove the encoder by using an alternate method called token embedding, a process that uses pre-trained matrices in place of the encoder.” This illustrates that the model GPT2 Hong describes is an accelerator-based model or accelerated model, and is pre-trained.
Further, Hong in section II Background, part A, pages 617-618, describes “GPT-2 Structure The token embedding, located at the beginning of the decoder, is responsible for converting an input word(s) into an embedding vector. The input word is converted to the numeric token ID based on a dictionary. Then, the pre-trained matrices, word token embedding (WTE) and word position embedding (WPE), are indexed with the token ID to obtain the corresponding vectors. WTE contains token-related encoding, and WPE contains position-related encoding. The two vectors are added to get the embedding vector. LM head, located at the end of the decoder, has the opposite role to the token embedding. It converts the output embedding vector into the token ID. ... The selected token ID represents the generated word.” Here, the model embeddings include the token ID, which represent as the set of embeddings generated by one or more accelerated machine learning models in response to input. Since pre-training is part of the training process, Hong overall teaches a model that is also trained using input data that is generated by more than one accelerated machine learning models and the same input data that has a set of embeddings generated by more than one accelerated machine learning models.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the base reference of Kokiopoulou with the teachings of Hong by using Kokiopoulou’s teachings of a method for performing neural architecture search to generate candidate model architectures for task-specific models using accelerators, and incorporate with Hong’s teaching of the accelerator-associated or accelerated machine learning models generate training input data comprising at least a set of embeddings.
One of ordinary skill in the art would be motivated to do so because by integrating Hong’s framework into the methods of Kokiopoulou, one with ordinary skill in the art would bring a “FPGA-based accelerator provides fully reprogrammable hardware to support new operations and larger dimensions of the evolving transformer with minimum cost ” (Hong, section Introduction, page 617), and “a multi-device system that adopts model parallelism and efficient network is necessary to maximize the amount of parallel computation with minimal additional latency” (Hong, section III Motivation, part C. Parallel Computing, page 619 ).

Claims 13 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Kokiopoulou in in view of Hong, and in further view of Lym S. et al., “PruneTrain: Fast Neural Network Training by Dynamic Sparse Model Reconfiguration”, available at https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10902313, published for a conference in November 17-22, 2019, (hereafter, Lym), 
Claim 13:
	Regarding claim 13, Kokiopoulou in view of Hong teaches the limitations in claim 12.
However, Kokiopoulou in view of Hong fail to teach the limitation “the system of claim 12, wherein training the set of task-specific machine learning models based upon the set of candidate model architectures comprises refraining from using the set of weights for each candidate model architecture of the set of candidate model architectures,”
In an analogous art, Lym teaches “the system of claim 12, wherein training the set of task-specific machine learning models based upon the set of candidate model architectures comprises refraining from using the set of weights for each candidate model architecture of the set of candidate model architectures,”
See Lym in page 4, section 4.1 Model pruning mechanism describe "this lasso regularization sparsifies groups of weights by forcing the weights in each group to very small values, when possible without incurring high error. After sparsification, we use a small threshold of 10-4  to zero out these weights." Further, in page 2, section I. Introduction. Lym describes, "for efficient execution on data-parallel training accelerators (e.g., GPUs), we group parameters at channel granularity and prune those channels for which all parameters are below a threshold." Note the examiner construes refraining from using model weights to mean the same as pruning, sparsifying, or zeroing out weights. Here in pages 2 and 4, Lym talks about sparsifying and pruning groups of weights for an architecture model using GPU accelerators, which is similar and relates to refraining from using the set of weights for each candidate model architecture of set of candidate model architectures. Here, the groups of weights can also refer to the set of weights.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the references of Kokiopoulou and Hong by using the teachings of Kokiopoulou and Hong in methods for model candidate architecture search then training the set of task-specific machine learning models based upon the set of candidate model architectures, and incorporate with Lym’s teaching of refraining from using the set of weights for each candidate model architecture of the set of models that were trained.
One of ordinary skill in the art would be motivated to do so because by integrating Hong’s framework into the methods of Kokiopoulou and Hong, one with ordinary skill in the art would achieve the goal of achieving “a more cost-efficient but still dense form. PruneTrain accelerates model training by reducing computation, memory access, and communication costs” (Lym, page 2, Section Introduction ).

Claim 19:
Regarding claim 19, Kokiopoulou in view of Hong teaches the limitations in claim 18.
However, Kokiopoulou in view of Hong fail to teach the limitation “the system of claim 18, wherein the instructions are executable by the one or more processors to further configure the system to discard the set of weights for each candidate model architecture of the set of candidate model architectures,”
In an analogous art, Lym teaches “the system of claim 18, wherein the instructions are executable by the one or more processors to further configure the system to discard the set of weights for each candidate model architecture of the set of candidate model architectures,”
See Lym in page 4, section 4.1 Model pruning mechanism describe "this lasso regularization sparsifies groups of weights by forcing the weights in each group to very small values, when possible without incurring high error. After sparsification, we use a small threshold of 10-4  to zero out these weights." Further, in page 2, section I. Introduction. Lym describes, "for efficient execution on data-parallel training accelerators (e.g., GPUs), we group parameters at channel granularity and prune those channels for which all parameters are below a threshold." Note the examiner construes discard the set of weights to mean the same as pruning, sparsifying, or zeroing out weights. Here in pages 2 and 4, Lym talks about sparsifying and pruning groups of weights for an architecture model using GPU accelerators, which relates to discarding from using the set of weights for each candidate model architecture of set of candidate model architectures. Here, the groups of weights can also refer to the set of weights.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the references of Kokiopoulou and Hong by using the teachings of Kokiopoulou and Hong in methods for model candidate architecture search then training the set of task-specific machine learning models based upon the set of candidate model architectures, and incorporate with Lym’s teaching of discarding the set of weights for each candidate model architecture of the set of models that were trained.
One of ordinary skill in the art would be motivated to do so because by integrating Hong’s framework into the methods of Kokiopoulou and Hong, one with ordinary skill in the art would achieve the goal of achieving “a more cost-efficient but still dense form. PruneTrain accelerates model training by reducing computation, memory access, and communication costs” (Lym, page 2, Section Introduction ).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to WENWEI ZENG whose telephone number is (571)272-7111. The examiner can normally be reached Monday-Friday, 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Usmaan Saeed can be reached at (571) 272-4046. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/WenWei Zeng/Examiner, Art Unit 2146                                                                                                                                                                                                        
/USMAAN SAEED/Supervisory Patent Examiner, Art Unit 2146
Read full office action
SYSTEMS AND METHODS FOR GENERATING MODEL ARCHITECTURES FOR TASK-SPECIFIC MODELS IN ACCELERATED TRANSFER LEARNING

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

SYSTEMS AND METHODS FOR GENERATING MODEL ARCHITECTURES FOR TASK-SPECIFIC MODELS IN ACCELERATED TRANSFER LEARNING

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email