Last updated: April 18, 2026
Application No. 18/440,119
LEARNING COACH FOR MACHINE LEARNING SYSTEM

Final Rejection §103
Filed
Feb 13, 2024
Examiner
NGUYEN, TRI T
Art Unit
2128
Tech Center
2100 — Computer Architecture & Software
Assignee
D5Ai LLC
OA Round
4 (Final)
Interview Optional

— +13.2% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 183 resolved cases, 2023–2026
Examiner Intelligence

NGUYEN, TRI T View full profile →
Grants 68% — above average
Career Allow Rate
125 granted / 183 resolved
+13.3% vs TC avg
Moderate +13% lift
Without
With
+13.2%
Interview Lift
resolved cases with interview
Typical timeline
3y 10m
Avg Prosecution
31 currently pending
Career history
214
Total Applications
across all art units
Statute-Specific Performance

§101
15.7%
-24.3% vs TC avg
§103
57.5%
+17.5% vs TC avg
§102
7.2%
-32.8% vs TC avg
§112
14.2%
-25.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 183 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
The amendment filed 02/27/2026 has been entered. Claims 18-37 remain pending in the application. Applicant’s amendments to the claims have overcome the objection previously set forth in the Non-Final Office Action mailed 01/05/2026.

Response to Arguments
Applicant’s arguments, filed 02/27/2026, with respect to the rejections of claims 18 and 28 under 103 have been fully considered and are persuasive because of the amendments. Therefore, the rejection has been withdrawn. However, upon further consideration, a new ground(s) of rejection is made in view of Li et al. (US Pub. 2016/0078339) in view of Rosswog et al. (US Patent 9,514,414) and further in view of Paquet et al. (US Pub. 2012/0158620).
Applicant argues (Pages 10-11)
A. The Cited References, Individually and Collectively, Fail to Teach or Suggest the Independent Claims
The Office Action rejected the independent claims over the combination of Li and Rosswog. The Office Action additionally relied on Paquet to reject dependent claims 27 and 37. Li, Rosswog and Paquet, individually and collectively, fail to teach or suggest the elements of the amended independent claims. Nor do the other references cited in the Office Action.
1. The Cited References Do Not Teach or Suggest Probabilistic Selection of Classified Data from Multiple Classifiers under Control of a Transmitted Control Parameter
As amended, independent claim 18 expressly recites:
"a reference system ... wherein the reference system comprises two or more classifiers each configured to produce classifications for corresponding inputs,"
and further recites:
"the control parameter controls a probability at which classifications produced by each of the two or more classifiers of the reference system are selected for inclusion in the set of additional training data for the student ML system."
These limitations require a specific architecture in which:
• multiple classifiers collectively form a reference system that generates classified data;
• a separate learning experimentation system transmits a control parameter to that reference system; and
• the control parameter governs probabilistic selection of classifier outputs used as
additional training data to train a student ML system to imitate the reference system.
The Office Action relies on Li as disclosing a teacher-student training framework, Rosswog as disclosing evaluation or retraining based on performance, and Paquet as disclosing multiple classifiers or ensemble techniques. However, none of these references teaches or suggests probabilistic selection of classified data from multiple classifiers under control of a transmitted control parameter, as expressly recited in the amended claims.
Li discloses training a neural network based on outputs of another model but does not disclose selecting classifications from multiple classifiers based on a probability parameter transmitted by a separate experimentation system. Rosswog describes classifier evaluation and retraining but does not disclose probabilistic generation of training data from multiple classifiers. Paquet describes ensembles of classifiers for improving classification accuracy but does not disclose probabilistic selection of classifier outputs to form training data for a separate student system, nor does Paquet disclose control of such selection by a learning experimentation system observing the student system during training.
Accordingly, the cited references do not teach or suggest the expressly recited probabilistic classifier-selection architecture now set forth in independent claims 18 and 28.
In response
in response to applicant's arguments against the references individually, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references. See In re Keller, 642 F.2d 413,208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986).
The Applicant points at each individual reference and argues that the reference does not teach the claim limitations, while the combination of references does read on the claim limitations.
The combination of references above teaches the argued limitations.
Li in Fig, 3, paragraphs 0042-0043 teaches a reference system for generating the training data, in Fig. 1, teaches a learning experimentation system.
Rosswog in Fig. 2, Col. 12 teaches the learning experimentation system transmits a control parameter to the reference system based on the observations.
Paquet in paragraph 0023 teaches a reference system comprising multiple automated classifiers 32 and human classifiers configured to produce classifications for content items 14; 
Paquet in paragraph 0032 also teaches a control parameter (a classification confidence threshold 54)
Further, in paragraphs 0023 and 0032, Paquet teaches the probability at which classifications generated by these classifiers are selected for inclusion in the set of additional training data is based on the control parameter (classification confidence threshold 54), classification confidence threshold 54 is used to determine which content items (with generated classifications are below the classification confidence threshold 54) are selected to provide to the human classifier to classify and are used in a supplemental training. 
The control parameter (classification confidence threshold 54) controls a probability that the content items which classifications generated by the classifiers (automated and human classifiers) are selected to add to the training data set by comparing the classification confidence with the threshold, and the classification confidence that is lower than the threshold is used to add in the training data set, it can be seen that if the classification confidence threshold 54 is set higher, the probability of the content items having  classification scores that are below the threshold is higher resulting to more content items (with generated classifications are below the classification confidence threshold 54) are provided to the human classifier to later adding into the supplemental training data set].
Therefore, the combination of Li, Rosswog and Paquet does teach the claim limitations.

Applicant argues (Pages 11-12)
The Cited References Do Not Teach or Suggest a Distinct Learning Experimentation System that Observes a Student System During Training and Dynamically Controls How a Separate Reference System Generates Training Data for that Student System
Independent claim 18 additionally requires a three-component feedback architecture including:
• a student ML system trained iteratively,
• a separate reference system that generates additional training data based on its classifications, and
• a learning experimentation system that receives observations from the student system during training and transmits a control parameter affecting how the reference system generates training data, with the additional training data used to train the student system to imitate the reference system.
The Office Action characterizes Li as providing a teacher-student training framework, Rosswog as providing performance-based evaluation or retraining, and Paquet as providing multiple classifiers. However, none of these references discloses or suggests a distinct learning experimentation system that observes a student system during training and dynamically controls how a separate reference system generates training data for that student system. Li's teacher-student arrangement transfers knowledge from a teacher model to a student model; it does not disclose a third system that observes student training and controls how training data are generated. Rosswog likewise does not disclose such a third system or a feedback mechanism affecting generation of training data by a separate reference system. Paquet's ensemble classifiers are directed toward improving classification performance of the ensemble itself, not toward generating training data for a separate student system under feedback control. Thus, even independent of the recent amendments, the cited references do not disclose the claimed three-system architecture.
In response
Again, the Applicant points at each individual reference and argues that the reference does not teach the claim limitations, while the combination of references does read on the claim limitations.
As mentioned above and on the 103 rejections section below, Li in Fig, 3, paragraph 0042-0043 teaches a reference system for generating the training data, in Fig. 1, teaches a learning experimentation system, thus Li teaches two distinct systems, reference and learning experimentation systems.
Li in paragraphs 0005, 0034 and 0043 also teaches how the reference system generates training data, with the additional training data used to train the student system to imitate the reference system [paragraph 0005, “The student DNN may be iteratively optimized until its output converges with the output of the teacher DNN. In this way, the student DNN approaches the behavior of the teacher, so that whatever the output of the teacher, the student will approximate”]. And these paragraphs of Li also recite a concept of the teacher DNN keeps training the student DNN with additional data until the student output converges with the output of the teacher model.
Rosswog in Fig. 2, Col. 12 teaches the learning experimentation system transmits a control parameter to the reference system based on the received observations to generate additional data (generating training data … under feedback control).
While in paragraphs 0023 and 0032, Paquet teaches the probability at which classifications generated by these classifiers are selected for inclusion in the set of additional training data is based on the control parameter (classification confidence threshold 54).
Therefore, the combination of Li, Rosswog and Paquet does teach the claim limitations, especially the claimed three-system architecture argues above.

Applicant argues (Pages 12-13)
The Office Action suggests that it would have been obvious to combine Li's teacher-student training approach with Rosswog's performance evaluation concepts and Paquet's multi-classifier teachings. However, the rejection does not articulate a persuasive rationale explaining why one of ordinary skill in the art would have modified Li in the specific manner required by the claims.
Li's disclosed system operates according to a teacher-student paradigm in which knowledge is transferred directly from a teacher model to a student model. The teacher produces outputs-such as classifications or probability distributions-and the student is trained to approximate those outputs. The core mechanism in Li is therefore direct model-to-model knowledge transfer, where supervision flows from a teacher to a student without alteration of the underlying training data generation mechanism.
In contrast, amended independent claims 18 and 28 require a materially different architecture. The claims do not merely require training a student based on outputs of another model. Instead, they require (i) a reference system comprising two or more classifiers, (ii) a distinct learning experimentation system that receives observations from the student system during training, and (iii) transmission of a control parameter that governs a probability at which classifications produced by each of the two or more classifiers are selected for inclusion in a set of additional training data used to train the student system.
This architecture is not a direct teacher-student supervision framework. It is a feedback- controlled training-data generation system in which the composition of training data is dynamically modulated based on observed student behavior. The claimed system alters which classifier outputs are selected, and at what probability, through a control parameter determined by a separate experimentation system. Li does not disclose or suggest such dynamic modulation of training data composition.
Modifying Li to incorporate this probabilistic classifier-selection mechanism would not constitute a routine or predictable variation. It would change the operative training mechanism from fixed teacher supervision to a multi-source, feedback-controlled data-generation architecture involving three distinct machine-learning components. Li relies on direct supervision from a teacher model; it does not contemplate dynamically varying the probability of selecting outputs from multiple classifiers based on observed training behavior of the student.
Rosswog and Paquet do not supply the missing rationale. Rosswog concerns classifier evaluation and retraining but does not suggest replacing Li's teacher supervision with probabilistically generated training data governed by a control parameter transmitted from a separate experimentation system. Paquet concerns ensemble aggregation for improving classification performance of the ensemble itself, not the generation of training data for a separate student system under dynamic feedback control. 
The Office Action does not identify any teaching in the cited references that would have motivated one of ordinary skill in the art to restructure Li's teacher-student framework into the
claimed three-system, feedback-controlled probabilistic training-data generation architecture. Nor does the Office Action provide a reasoned explanation grounded in the prior art as to why such a modification would have been desirable.
Absent such articulated reasoning with rational underpinning, the proposed combination relies on hindsight reconstruction using Applicant's disclosure as a blueprint. The rejection therefore does not satisfy the requirements for establishing obviousness under 35 U.S.C. §103.
In response
The Applicant lists out the elements recited in claims 18, 28 and argues that “This architecture is not a direct teacher-student supervision framework”. The examiner respectfully disagrees. 
The argued limitation clearly states in items (ii)-(iii) “receives observations from the student system during training … a set of additional training data used to train the student system”, those reciting indicate the student system is trained using a teacher or coach or other system as reciting throughout the specification of the current Application such as in Figs. 1, 12, paragraphs 0004 – 0006, 0026 … “system that comprises one or more "student" machine learning systems along with at least one "coach" machine learning system. The coach machine learning system itself uses machine learning to help the student machine learning system(s). For example, by monitoring a student machine learning system, the coach machine learning system can learn (through machine learning techniques) "hyperparameters" for the student machine learning system that control the machine learning process for the student learning system … the machine learning coach observes the student learning system(s) while the student learning system(s) is/are in the learning process and the machine learning coach makes its changes to the student learning system(s) (e.g., hyperparameters, structural modifications, etc.) while the student learning system(s) is/are in the learning process”, and also, claims 25, 35 recite “a learning coach ML system that is in communication with the student ML system, wherein: the learning coach ML system has been trained through machine learning to determine one or more revised hyperparameter values for the student ML system”.
In addition, the references used comprise all of the argued elements and the communication between them such as: 
Li in Fig, 3, abstract, paragraphs 0005, 0032-0043 teaches a reference system for generating the training data, in Fig. 1, teaches a learning experimentation system, and an iterative process to train the student DNN. Li in paragraph 0026 teaches the student DNN and the teacher DNN which are both the classifiers.
Rosswog in Fig. 2, Col. 12 teaches the learning experimentation system transmits a control parameter to the reference system based on the observations.
Paquet in paragraph 0023 teaches a reference system comprising multiple automated classifiers 32 and human classifiers configured to produce classifications for content items 14; 
Paquet in paragraph 0032 also teaches a control parameter (a classification confidence threshold 54)
Further, in paragraphs 0023 and 0032, Paquet teaches the probability at which classifications generated by these classifiers are selected for inclusion in the set of additional training data is based on the control parameter (classification confidence threshold 54), classification confidence threshold 54 is used to determine which content items (with generated classifications are below the classification confidence threshold 54) are selected to provide to the human classifier to classify and are used in a supplemental training. 
Since Li teaches a process of iteratively training a student DNN from trained teacher DNN, at each iteration, the student DNN is trained using the output from the teacher DNN, the outputs from the student DNN and the teacher DNN is compared by the evaluating component to determine convergence, if the output has not converged, iteration may continue to further training the student to approximate the teacher, and the teacher DNN will process the input to generate another output (additional training data) that is used to train/retrain the student DNN. 
Li, however, is silent of the teacher DNN (reference system) receiving a signal (control parameter) from the evaluating component (the learning experimentation system) to complete another iteration to generate another output (additional training data) for training the student DNN.  
While Rosswog teaches a seed set generator 230 (reference system) receives a control parameter (to increase the training data) from the learning experimentation system to generate additional training data to retrain machine learning algorithm 252 to improve its categorization performance. Therefore, adding Rosswog into Li would help the system of Li to generate additional data to train the student model when receiving a signal (control parameter) which is generated based on observations (feedback).
Even though Li (as modified) teaches the student and teacher DNN/classifiers, for each iteration, process the input to generate the outputs, the outputs are compared to determine if further training is needed, if the output has not converged, additional is generated by the reference system based on a control parameter to train the student DNN. Li (as modified) however is silent of the “the reference system comprises two or more classifiers for classifying input data” and “the control parameter controls a probability at which classifications produced by each of the two or more classifiers of the reference system are selected for inclusion in the set of additional training data for the student ML system”. 
Paquet in paragraph 0023 teaches a reference system comprising multiple automated classifiers 32 and human classifiers configured to produce classifications for content items 14; 
Paquet in paragraph 0032 also teaches a control parameter (a classification confidence threshold 54)
Further, in paragraphs 0023 and 0032, Paquet teaches the probability at which classifications generated by these classifiers are selected for inclusion in the set of additional training data is based on the control parameter (classification confidence threshold 54), classification confidence threshold 54 is used to determine which content items (with generated classifications are below the classification confidence threshold 54) are selected to provide to the human classifier to classify and are used in a supplemental training. 
The control parameter (classification confidence threshold 54) controls a probability that the content items which classifications generated by the classifiers (automated and human classifiers) are selected to add to the training data set by comparing the classification confidence with the threshold, and the classification confidence that is lower than the threshold is used to add in the training data set, it can be seen that if the classification confidence threshold 54 is set higher, the probability of the content items having  classification scores that are below the threshold is higher resulting to more content items (with generated classifications are below the classification confidence threshold 54) are provided to the human classifier to later adding into the supplemental training data set].
Combining Paquet would help the system of Li (as modified) to control the amount of classification data used in generating additional training data to train the student DNN, allowing the student DNN to get more accurate output.
Therefore, the combination of Li, Rosswog and Paquet does teach the claim limitations.

Applicant argues (Pages 13-15)
The Dependent Claims Are Patentable for At Least the Same Reasons and Further Recite Additional Limitations Not Taught or Suggested by the Cited References
Claims 19-27 and 29-37 depend, directly or indirectly, from independent claims 18 and 28. Because the cited references fail to teach or suggest the limitations of the independent claims, the dependent claims are likewise patentable for at least the same reasons. See MPEP § 2143.03.
In addition, the dependent claims recite further structural and functional limitations that are not taught or suggested by the cited references.
In response
As explained above, the combination of Li, Rosswog and Paquet teaches the claim limitations of the independent claims.
Also, the argument does not provide any details or evidence why the cited references fail to teach the limitations. As rationale and evidence has been provided in the section regarding 35 U.S.C. 103, for this argument to be persuasive, some more argumentation is required beyond a simple assertion. 
Applicant should point out disagreements with the examiner's contentions. Applicant must also discuss the references applied against the claims, explaining how the claims avoid the references or distinguish from them.
Applicant further argues
Claims 20 and 30 require that an architecture of the reference system is the same as an architecture of the student ML system. None of Li, Rosswog, or Paquet discloses or suggests configuring a multi-classifier reference system having the same architecture as the student system in the context of the claimed feedback-controlled probabilistic training-data generation framework.
In response
In the response above, the examiner has explained why the cited references teach the claimed feedback-controlled probabilistic training-data generation framework, and in the 1013 rejections section below, Paquet in paragraph 0023, teaches multiple automated classifiers have the same architecture, thus, the combination of the cited references teaches the limitation of claim 20 and 30.
Applicant further argues
Claims 21, 26, 31, and 36 require, inter alia, specific computational relationships between the student and reference systems, including that the student performs a superset of computations of the reference system and that the student is trained to imitate the reference system to transfer learning. The cited references do not disclose or suggest this claimed computational relationship within the context of the three-system architecture recited in the independent claims.
In response
In the response above, the examiner has explained why the cited references teach the claimed feedback-controlled probabilistic training-data generation framework. The examiner then adding Aslan reference to teach limitations on those argued claims. (Please see the 103 rejections below for details)
Also, the argument does not provide any details or evidence why the cited references fail to teach the limitations. As rationale and evidence has been provided in the section regarding 35 U.S.C. 103, for this argument to be persuasive, some more argumentation is required beyond a simple assertion. 
Applicant should point out disagreements with the examiner's contentions. Applicant must also discuss the references applied against the claims, explaining how the claims avoid the references or distinguish from them.
Applicant further argues
Claims 22-24 and 32-34 recite specific neural-network structural features, including input layers, inner layers, node activations, learned parameters, and iterative updating of parameters. While Bazrafkan and Chaudhari may disclose neural-network components in isolation, they do not disclose or suggest these structural features within the claimed architecture in which a learning experimentation system dynamically controls probabilistic selection of classified data from multiple classifiers to generate additional training data for a student system.
In response
As mentioned above, the limitation of “a learning experimentation system dynamically controls probabilistic selection of classified data” is taught by Paquet, while limitations recite in the argued claims above is taught by Li, Rosswog, Paquet in view of in view of Bazrafkan et al. and further in view of Chaudhari et al. (Please see the 103 rejections below for details).
Also, the argument does not provide any details or evidence why the cited references fail to teach the limitations. As rationale and evidence has been provided in the section regarding 35 U.S.C. 103, for this argument to be persuasive, some more argumentation is required beyond a simple assertion. 
Applicant should point out disagreements with the examiner's contentions. Applicant must also discuss the references applied against the claims, explaining how the claims avoid the references or distinguish from them.
Applicant further argues
Claims 25 and 35 further recite a learning coach ML system configured to determine revised hyperparameter values for the student ML system based on internal state observations of the student system during training on the additional training data generated by the reference system. The Office Action cites Zoph and Li 2 for these limitations. However, neither reference discloses or suggests the claimed integration of a learning coach ML system with the three component feedback-controlled architecture of the independent claims. In particular, neither reference teaches determining revised hyperparameters for a student ML system based on internal state observations obtained during training on additional training data that are themselves probabilistically selected under control of a transmitted control parameter.
In response
As explained above, limitations regarding “the three-component feedback-controlled architecture of the independent claims, and additional training data that are themselves probabilistically selected under control of a transmitted control parameter” are taught by the combination of Li, Rosswog, Paquet. Zoph and Li 2 are added to read on the limitations on claims 25 and 35 only. (Please see the 103 rejections below for details).
Applicant further argues
Claims 27 and 37 further specify that the control parameter is tunable and that the reference system randomly selects classifications from the two or more classifiers, wherein the probability controlled by the control parameter governs that random selection. As discussed above, the cited references do not disclose or suggest this probabilistic selection mechanism under control of a transmitted parameter. The additional references cited for these claims do not remedy that deficiency.
Accordingly, the dependent claims are not rendered obvious by the cited combinations for at least the reasons discussed above and further recite additional limitations not taught or suggested by the applied references. Withdrawal of the rejections of claims 19-37 is therefore respectfully requested.
In response
The limitations of claims 27 and 37 are similar to the limitations of the independent claims and therefore, are rejected by the same reason, and also, limitation of “the control parameter is tunable” is taught by Paquet in paragraph 0030-0032. (Please see the 103 rejections below for details).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 18-20 and 27-30 and 37 are rejected under 35 U.S.C. 103 as being unpatentable over Li et al. (US Pub. 2016/0078339) in view of Rosswog et al. (US Patent 9,514,414) and further in view of Paquet et al. (US Pub. 2012/0158620).
	As per claim 18, Li teaches a machine learning (ML) computer system comprising [abstract, “Systems and methods are provided for generating a DNN classifier by "learning" a "student" DNN model from a larger more accurate "teacher" DNN model”]: 
a student ML system that is iteratively trained through machine learning on training data to perform a machine learning task [Fig. 3, abstract, “an iterative process is applied to train the student DNN by minimize the divergence of the output distributions from the teacher and student DNN models”; paragraph 0005, “To learn a DNN with a smaller number of hidden nodes, a larger size (more accurate) "teacher" DNN is used to train the smaller "student" DNN … The student DNN may be trained from un-la be led (or un-transcribed) data … The student DNN may be iteratively optimized until its output converges with the output of the teacher DNN”]; 
a reference system for generating the training data for the student ML system [Fig. 3, paragraphs 0042-0043, “system 300 for learning a smaller student DNN from a larger teacher DNN … teacher DNN 302 comprises a trained DNN model … Initially, student DNN 301 is untrained or may be pre-trained, but has not yet been trained by the teacher DNN … for each iteration, a small piece of unlabeled (or un-transcribed) data 310 is provided to both student DNN 301 and teacher DNN 302. Using forward propagation, the posterior distribution (output distribution 351 and 352) is determined. An error signal 360 is then determined from the distribution 351 and 352. The error signal may be calculated by determining the KL divergence between distributions 351 and 352 … If the output distribution 351 of student DNN 301 has converged with the output distribution 352 of teacher DNN 302, then the student DNN is deemed to be trained. However, if the output has not converged, and in some embodiments the output still appears to be converging, then the student DNN 301 is trained based on the error”; paragraphs 0026-0027, “As shown in FIG. 1, storage 106 includes DNN models 107 and 109. DNN model 107 represents a teacher DNN model, and DNN model 109 represents a student DNN model having a smaller size than teacher DNN model 107 … DNN models (or DNN classifiers) … The DNN model generator 120, in general, is responsible for generating DNN models, such as the CD-DNN-HMM classifiers”; The examiner interprets the teacher DNN (classifier) as a reference system. It can be seen that Fig. 3 and the above cited paragraphs disclose a process of training a student DNN using the teacher DNN. At each iteration, the teacher DNN (reference system) process input data to generate an output which is used to train the student DNN. The examiner interprets the output generated by the teacher DNN (reference system) at each iteration as the training data for the student DNN], wherein the reference system is for generating the training data for the ML system based on classifications that the reference system produces for corresponding inputs [Fig. 3, paragraphs 0042-0043, “system 300 for learning a smaller student DNN from a larger teacher DNN … teacher DNN 302 comprises a trained DNN model … Initially, student DNN 301 is untrained or may be pre-trained, but has not yet been trained by the teacher DNN … for each iteration, a small piece of unlabeled (or un-transcribed) data 310 is provided to both student DNN 301 and teacher DNN 302. Using forward propagation, the posterior distribution (output distribution 351 and 352) is determined. An error signal 360 is then determined from the distribution 351 and 352. The error signal may be calculated by determining the KL divergence between distributions 351 and 352 … If the output distribution 351 of student DNN 301 has converged with the output distribution 352 of teacher DNN 302, then the student DNN is deemed to be trained. However, if the output has not converged, and in some embodiments the output still appears to be converging, then the student DNN 301 is trained based on the error”; As explained above, Fig. 3 and the above cited paragraphs disclose a process of training a student DNN using the teacher DNN. At each iteration, the teacher DNN (reference system) process input data to generate an output which is used to train the student DNN, and since the teacher DNN (reference system) is a classifier, it generates a classification output for corresponding input]; and 
a learning experimentation system [Fig. 1 shows the evaluating component 128], wherein:
the learning experimentation system comprises a computer [Fig. 7, paragraph 0021, “The components shown in FIG. 1 may be implemented on or using one or more computing devices, such as computing device 700 described in connection to FIG. 7”];
the learning experimentation system receives observations from the student ML system during training of the student ML system [paragraphs 0036-0037, “evaluating component 128 evaluates the output distributions of the student and teacher DNNs, determines the difference (which may be determined as an error signal) between the outputs and also determines whether the student is continuing to improve or whether the student is no longer improving … evaluating component 128 determine whether to complete another iteration … evaluating component 128 determines whether to continue iterating based on whether the student is continuing to show improvement … evaluating component 128 apply a threshold to determine convergence of the teacher DNN and student DNN output distributions. Where the threshold is not satisfied, iteration may continue, thereby further training the student to approximate the teacher. Where the threshold is satisfied, then convergence is determined (indicating the student output distribution is sufficiently close enough to the teacher DNN' s output distribution) and the student DNN may be considered trained … evaluating component 128 evaluates the student DNN according to the methods 500 and 600”];
the reference system generates a set of additional training data for the student ML system [paragraph 0005, “The student DNN may be iteratively optimized until its output converges with the output of the teacher DNN”; Fig. 3, paragraphs 0034 and 0043, disclose a process of training the student DNN, where, at each iteration, unlabeled input data is provided to both student DNN and teacher DNN (reference system), the outputs from the DNNs are compared to determine the error, if the output of the student DNN has not converged with the output of the teacher DNN, then the student is trained again (complete another iteration) using the output generated by the teacher DNN (reference system) which examiner interprets as the additional training data for the student DNN]; and
the set of additional training data is input to the student ML system such that the set of additional training data trains the student ML system to imitate classifications by the reference system [Fig. 3, paragraphs 0034 and 0043, disclose a process of training the student DNN, where, at each iteration, unlabeled input data is provided to both student DNN and teacher DNN (reference system), the outputs from the DNNs are compared to determine the error, if the output of the student DNN has not converged with the output of the teacher DNN, then the student is trained again (for another iteration) using the output generated by the teacher DNN (reference system) as a training input; paragraph 0005, “The student DNN may be iteratively optimized until its output converges with the output of the teacher DNN. In this way, the student DNN approaches the behavior of the teacher, so that whatever the output of the teacher, the student will approximate”; It can be seen that the student DNN is iteratively trained using the outputs generated by the teacher DNN (additional training data) as input until its output converges with the output of the teacher DNN (reference system); paragraph 0027, “training an initialized "student" DNN model to approximate a trained teacher DNN model having a larger model size (e.g. number of parameters) than the student”]. 
Li does not teach
wherein the reference system comprises two or more classifiers each configured to produce classifications for corresponding inputs; 
based on the observations, the learning experimentation system transmits a control parameter to the reference system;
based on the control parameter, the reference system generates a set of additional training data for the student ML system, wherein the control parameter controls a probability at which classifications produced by each of the two or more classifiers of the reference system are selected for inclusion in the set of additional training data for the student ML system;
Rosswog teaches
based on the observations, the learning experimentation system transmits a control parameter to the reference system [Fig. 2, Col. 12, lines 37-66, “document categorizer 250 may include a performance tracker 254 that tracks one or more metrics associated with the performance of document categorizer 250's categorizations. The metrics may include the number of electronic documents categorized in each category (e.g., relevant and not relevant), the confidence modifiers of all the categorized electronic documents … document categorizer 250 may send an indication to seed set generator 230 (included in an admin subsystem 111) that a second or subsequent seed set of electronic document classifications is needed (control parameter- increasing the training data) to retrain machine learning algorithm 252 to improve its categorization performance”; Examiner interprets an admin subsystem 111 which comprising a seed set generator 230 as a reference system that generating the training data];
based on the control parameter, the reference system generates a set of additional training data for the student ML system [Col. 12, line 37 – Col. 13, lines 1-2, “document categorizer 250 may include a performance tracker 254 that tracks one or more metrics associated with the performance of document categorizer 250's categorizations … document categorizer 250 may send an indication to seed set generator 230 (included in an admin subsystem 111) that a second or subsequent seed set of electronic document classifications is needed (control parameter- increasing the training data) to retrain machine learning algorithm 252 to improve its categorization performance … seed set generator 230 may generate additional seed sets based on the metrics tracked by performance tracker 254”];
Since Li teaches a process of iteratively training a student DNN from trained teacher DNN, at each iteration, the student DNN is trained using the output from the teacher DNN, the outputs from the student DNN and the teacher DNN is compared by the evaluating component to determine convergence, if the output has not converged, iteration may continue to further training the student to approximate the teacher, and the teacher DNN will process the input to generate another output (additional training data) that is used to train/retrain the student DNN. 
Li, however, is silent of the teacher DNN (reference system) receiving a signal (control parameter) from the evaluating component (the learning experimentation system) to complete another iteration to generate another output (additional training data) for training the student DNN.  
While Rosswog teaches a seed set generator 230 (reference system) receives a control parameter (to increase the training data) from the learning experimentation system to generate additional training data to retrain machine learning algorithm 252 to improve its categorization performance. 
Therefore, the combination of Li and Rosswog teaches the above claim limitation.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to modify the method for generating a DNN classifier by learning a student DNN model from a larger teacher DNN model of Li to include the reference system to receive from the learning experimentation system, a control parameter for generating a set of additional training data of Rosswog. Doing so would help training or retraining a network with the additional training data generated by the reference system to improve the performance of the model (Rosswog, Col. 5, lines 49-53). 
Li and Rosswog do not teach
the reference system comprises two or more classifiers each configured to produce classifications for corresponding inputs; 
the control parameter controls a probability at which classifications produced by each of the two or more classifiers of the reference system are selected for inclusion in the set of additional training data for the student ML system;
Paquet teaches
the reference system comprises two or more classifiers each configured to produce classifications for corresponding inputs [paragraph 0023, “the automated classifier 32 is developed using a training set 34 comprising a set of content items 14 for which an authoritative and reliable classification 18 into one or more categories… The training set 34 may be generated … by utilizing a sample content set 12 prepared by another automated classifier 32”; paragraph 0006, “Content items having a low classification confidence may be selected and provided to human classifiers, who may identify one or more categories that are associated with the content item. These human-selected classifications of content items may therefore be utilized as a new training set”]; It can be seen that the system of Paquet comprises multiple automated classifiers 32 and human classifiers configured to produce classifications for content items 14; 
the control parameter [paragraph 0032, “a classification confidence threshold 54 defined by the device 82”] controls a probability at which classifications produced by each of the two or more classifiers of the reference system are selected for inclusion in the set of additional training data for the student ML system [paragraph 0023, “the automated classifier 32 is developed using a training set 34 … The training set 34 may be generated … by utilizing a sample content set 12 prepared by another automated classifier 32”; Figs 2 and 4, paragraph 0030, “The automated classifier 32 may be invoked to perform a classification 18 of content items 14 of a content set 12 ( such as the three content items 14 identified in this exemplary scenario 50 as "A", "B", and "C"), and each classification 18 may result in an identified association with one or more categories 16, and also a classification confidence 52 (e.g., computed as a probability between 0.00, indicating no confidence, and 1.00, indicating absolute confidence). An embodiment of these techniques may compare the classification confidence 52 of each classification 18 with a classification confidence threshold 54 (e.g., a 0.50 probability) that distinguishes acceptably confident classifications 18 from unacceptably confident classifications 18. For example, the content item 14 identified as "B" may be classified with a classification confidence 52 of 0.96 that well exceeds a defined classification confidence threshold 54 of 0.50, while the content items 14 identified as "A" and "C" may be classified with unacceptably low classification confidences 52 of 0.24 and 0.03. Accordingly, an embodiment of these techniques may select these content items 14 for inclusion in a supplemental training set 34, and may provide this training set 34 to a human classifier 20 for classification 18. After the human classifier 20 identifies one or more categories 16 associated with each content item 14, these associations may be used in a supplemental training 36 in order to improve the proficiency of the automated classifier 32 in classifying these types of content items 14. (The supplemental training 36 … include … the content items 14 from the initial training set 34, and/or from previously generated supplemental training sets 34.) In this manner, the supplementally trained automated classifier 32 may therefore exhibit a wider range of acceptably accurate classifications 18”; Examiner interprets the classification confidence threshold 54 as a control parameter. Based on the citing above, it can be seen that the system of Paquet comprises at least two classifiers including the automated classifier 32 and human classifier 20 configured to produce classifications for corresponding inputs (items 14), and the probability at which classifications generated by these classifiers are selected for inclusion in the set of additional training data is based on the control parameter (classification confidence threshold 54). As recited in paragraph 0023 above of Paquet, classification confidence threshold 54 is used to determine which content items (with generated classifications are below the classification confidence threshold 54) are selected to provide to the human classifier to classify and are used in a supplemental training, it can be seen that the control parameter (classification confidence threshold 54) controls a probability that the content items which classifications generated by the classifiers (automated and human classifiers) are selected to add to the training data set, because if the classification confidence threshold 54 is set higher, the probability of the content items having  classification scores that are below the threshold is higher resulting to more content items (with generated classifications are below the classification confidence threshold 54) are provided to the human classifier to later adding into the supplemental training data set]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to modify the method for generating a DNN classifier by learning a student DNN model from a larger teacher DNN model of Li (as modified) to include two or more classifiers for classifying the input data to generate classified data, and the control parameter controls a probability at which classifications produced by each of the two or more classifiers of the reference system are selected for inclusion in the set of additional training data for the student ML system of Paquet. Doing so would help classifying content items with an acceptable classification confidence and accuracy (Paquet, 0029).

As per claim 19, Li, Rosswog and Paquet teach the ML computer system of claim 18.
Li further teaches
the observations are a learning behavior and performance of the student ML system [paragraphs 0036-0037, “evaluating component 128 evaluates the output distributions of the student and teacher DNNs, determines the difference (which may be determined as an error signal) between the outputs and also determines whether the student is continuing to improve or whether the student is no longer improving … evaluating component 128 determine whether to complete another iteration … evaluating component 128 determines whether to continue iterating based on whether the student is continuing to show improvement … evaluating component 128 apply a threshold to determine convergence of the teacher DNN and student DNN output distributions. Where the threshold is not satisfied, iteration may continue, thereby further training the student to approximate the teacher. Where the threshold is satisfied, then convergence is determined (indicating the student output distribution is sufficiently close enough to the teacher DNN' s output distribution) and the student DNN may be considered trained … evaluating component 128 evaluates the student DNN according to the methods 500 and 600”; paragraph 0005, “The student DNN may be iteratively optimized until its output converges with the output of the teacher DNN. In this way, the student DNN approaches the behavior of the teacher, so that whatever the output of the teacher, the student will approximate”].  

As per claim 20, Li, Rosswog and Paquet teach the ML computer system of claim 18.
Paquet further teaches
an architecture of the reference system is the same as an architecture of the student ML system [paragraph 0023, “an automated classifier 32, such as an artificial neural network (student ML system) … the automated classifier 32 is developed using a training set 34 … (wherein) The training set 34 may be generated … by utilizing a sample content set 12 prepared by another automated classifier 32”; Examiner interprets “the automated classifier 32 is developed using a training set 34” as a student ML system, and the “another automated classifier 32” that generated the training set is the classifier that included in the reference system, and  the reference system (comprising another automated classifier 32) has an architecture that is the same as an architecture of the student ML system “an automated classifier 32”].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to modify the method for generating a DNN classifier by learning a student DNN model from a larger teacher DNN model of Li to include an architecture of the reference system is the same as an architecture of the student ML system of Paquet. Doing so would help determining the categories of the content items of the training data within an acceptable range (Paquet, 0023).

As per claim 27, Li, Rosswog and Paquet teach the ML computer system of claim 18.
Paquet further teaches
wherein the control parameter is tunable [paragraph 0032, “a classification confidence threshold 54 defined by the device 82”; paragraph 0030, “compare the classification confidence 52 of each classification 18 with a classification confidence threshold 54 (e.g., a 0.50 probability); Examiner interprets the classification confidence threshold 54 as a control parameter, and it can be seen that the control parameter can be adjusted], wherein: 
the reference system randomly selects the classifications from the two or more classifiers as the set of additional training data for the student ML system [paragraph 0006, “These human-selected classifications of content items may therefore be utilized as a new training set to retrain the automated classifier in order to achieve an accurate classification of the difficult-to-classify content items; Figs 2 and 4, paragraph 0030, “The automated classifier 32 may be invoked to perform a classification 18 of content items 14 of a content set 12 ( such as the three content items 14 identified in this exemplary scenario 50 as "A", "B", and "C"), and each classification 18 may result in an identified association with one or more categories 16, and also a classification confidence 52 (e.g., computed as a probability between 0.00, indicating no confidence, and 1.00, indicating absolute confidence). An embodiment of these techniques may compare the classification confidence 52 of each classification 18 with a classification confidence threshold 54 (e.g., a 0.50 probability) that distinguishes acceptably confident classifications 18 from unacceptably confident classifications 18. For example, the content item 14 identified as "B" may be classified with a classification confidence 52 of 0.96 that well exceeds a defined classification confidence threshold 54 of 0.50, while the content items 14 identified as "A" and "C" may be classified with unacceptably low classification confidences 52 of 0.24 and 0.03. Accordingly, an embodiment of these techniques may select these content items 14 for inclusion in a supplemental training set 34, and may provide this training set 34 to a human classifier 20 for classification 18. After the human classifier 20 identifies one or more categories 16 associated with each content item 14, these associations may be used in a supplemental training 36 in order to improve the proficiency of the automated classifier 32 in classifying these types of content items 14. (The supplemental training 36 … include … the content items 14 from the initial training set 34, and/or from previously generated supplemental training sets 34.) In this manner, the supplementally trained automated classifier 32 may therefore exhibit a wider range of acceptably accurate classifications 18”; It can be seen that, classification is performed on the content items and each classification is assigned with a confidence score (probability), any classification confidence that is below a classification confidence threshold 54 defined by the device 82 is selected and send to a human classifier such that a category associated with the content item is identified, and a new training set is generated including the content items from the initial training set. For example, when the automated classifier 32 performs classification on the content items, the content item identified as "B" that is classified with a classification confidence of 0.96 that well exceeds a defined classification confidence threshold of 0.50 is selected from the automated classifier 32, while the content items 14 identified as "A" and "C" which are classified with unacceptably low classification confidences of 0.24 and 0.03 are sent to the human classifier to identify the classes associated with the content items, then those classified data from the human classifier combine with the previous selected classified data from the automated classifier 32 to generate a new training set”], and
the probability controlled by the control parameter governs the random selection of the classifications from each of the two or more classifiers [paragraph 0023, “the automated classifier 32 is developed using a training set 34 … The training set 34 may be generated … by utilizing a sample content set 12 prepared by another automated classifier 32”; Figs 2 and 4, paragraph 0030, “The automated classifier 32 may be invoked to perform a classification 18 of content items 14 of a content set 12 ( such as the three content items 14 identified in this exemplary scenario 50 as "A", "B", and "C"), and each classification 18 may result in an identified association with one or more categories 16, and also a classification confidence 52 (e.g., computed as a probability between 0.00, indicating no confidence, and 1.00, indicating absolute confidence). An embodiment of these techniques may compare the classification confidence 52 of each classification 18 with a classification confidence threshold 54 (e.g., a 0.50 probability) that distinguishes acceptably confident classifications 18 from unacceptably confident classifications 18. For example, the content item 14 identified as "B" may be classified with a classification confidence 52 of 0.96 that well exceeds a defined classification confidence threshold 54 of 0.50, while the content items 14 identified as "A" and "C" may be classified with unacceptably low classification confidences 52 of 0.24 and 0.03. Accordingly, an embodiment of these techniques may select these content items 14 for inclusion in a supplemental training set 34, and may provide this training set 34 to a human classifier 20 for classification 18. After the human classifier 20 identifies one or more categories 16 associated with each content item 14, these associations may be used in a supplemental training 36 in order to improve the proficiency of the automated classifier 32 in classifying these types of content items 14. (The supplemental training 36 … include … the content items 14 from the initial training set 34, and/or from previously generated supplemental training sets 34.) In this manner, the supplementally trained automated classifier 32 may therefore exhibit a wider range of acceptably accurate classifications 18”; Examiner interprets the classification confidence threshold 54 as a control parameter. Based on the citing above, it can be seen that the system of Paquet comprises at least two classifiers including the automated classifier 32 and human classifier 20 configured to produce classifications for corresponding inputs (items 14), and the probability at which classifications generated by these classifiers are selected for inclusion in the set of additional training data is based on the control parameter (classification confidence threshold 54). As recited in paragraph 0023 above of Paquet, classification confidence threshold 54 is used to determine which content items (with generated classifications are below the classification confidence threshold 54) are selected to provide to the human classifier to classify and are used in a supplemental training, it can be seen that the control parameter (classification confidence threshold 54) controls a probability that the content items which classifications generated by the classifiers (automated and human classifiers) are selected to add to the training data set, because if the classification confidence threshold 54 is set higher, the probability of the content items having  classification scores that are below the threshold is higher resulting to more content items (with generated classifications are below the classification confidence threshold 54) are provided to the human classifier to later adding into the supplemental training data set]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to modify the method for generating a DNN classifier by learning a student DNN model from a larger teacher DNN model of Li (as modified) to include selects the classifications from the two or more classifiers as the set of additional training data for the student ML system, and the probability controlled by the control parameter governs the random selection of the classifications from each of the two or more classifiers of Paquet. Doing so would help classifying content items with an acceptable classification confidence and accuracy (Paquet, 0029).

As per claim 28, Li teaches a computerized method of improving operation of a machine learning (ML) system, the method comprising [abstract, “Systems and methods are provided for generating a DNN classifier by "learning" a "student" DNN model from a larger more accurate "teacher" DNN model”]: 
initial training, iteratively through machine learning, by a computer system, a student ML system on training data to perform a machine learning task [Fig. 3, abstract, “an iterative process is applied to train the student DNN by minimize the divergence of the output distributions from the teacher and student DNN models”; paragraph 0005, “To learn a DNN with a smaller number of hidden nodes, a larger size (more accurate) "teacher" DNN is used to train the smaller "student" DNN … The student DNN may be trained from un-la be led (or un-transcribed) data … The student DNN may be iteratively optimized until its output converges with the output of the teacher DNN”], 
generating, by a reference system of the computer system, the training data for the student ML system [Fig. 3, paragraphs 0042-0043, “system 300 for learning a smaller student DNN from a larger teacher DNN … teacher DNN 302 comprises a trained DNN model … Initially, student DNN 301 is untrained or may be pre-trained, but has not yet been trained by the teacher DNN … for each iteration, a small piece of unlabeled (or un-transcribed) data 310 is provided to both student DNN 301 and teacher DNN 302. Using forward propagation, the posterior distribution (output distribution 351 and 352) is determined. An error signal 360 is then determined from the distribution 351 and 352. The error signal may be calculated by determining the KL divergence between distributions 351 and 352 … If the output distribution 351 of student DNN 301 has converged with the output distribution 352 of teacher DNN 302, then the student DNN is deemed to be trained. However, if the output has not converged, and in some embodiments the output still appears to be converging, then the student DNN 301 is trained based on the error”; paragraphs 0026-0027, “As shown in FIG. 1, storage 106 includes DNN models 107 and 109. DNN model 107 represents a teacher DNN model, and DNN model 109 represents a student DNN model having a smaller size than teacher DNN model 107 … DNN models (or DNN classifiers) … The DNN model generator 120, in general, is responsible for generating DNN models, such as the CD-DNN-HMM classifiers”; The examiner interprets the teacher DNN (classifier) as a reference system. It can be seen that Fig. 3 and the above cited paragraphs disclose a process of training a student DNN using the teacher DNN. At each iteration, the teacher DNN (reference system) process input data to generate an output which is used to train the student DNN. The examiner interprets the output generated by the teacher DNN (reference system) at each iteration as the training data for the student DNN], wherein the training data comprises classifications produced by the reference system for corresponding inputs [Fig. 3, paragraphs 0042-0043, “system 300 for learning a smaller student DNN from a larger teacher DNN … teacher DNN 302 comprises a trained DNN model … Initially, student DNN 301 is untrained or may be pre-trained, but has not yet been trained by the teacher DNN … for each iteration, a small piece of unlabeled (or un-transcribed) data 310 is provided to both student DNN 301 and teacher DNN 302. Using forward propagation, the posterior distribution (output distribution 351 and 352) is determined. An error signal 360 is then determined from the distribution 351 and 352. The error signal may be calculated by determining the KL divergence between distributions 351 and 352 … If the output distribution 351 of student DNN 301 has converged with the output distribution 352 of teacher DNN 302, then the student DNN is deemed to be trained. However, if the output has not converged, and in some embodiments the output still appears to be converging, then the student DNN 301 is trained based on the error”; As explained above, Fig. 3 and the above cited paragraphs disclose a process of training a student DNN using the teacher DNN. At each iteration, the teacher DNN (reference system) process input data to generate an output which is used to train the student DNN, and since the teacher DNN (reference system) is a classifier, it generates a classification output for corresponding input]; and
following the initial training of the student ML system:
receiving, by a learning experimentation system of the computer system, observations from the student ML system from training of the student ML system [paragraphs 0036-0037, “evaluating component 128 evaluates the output distributions of the student and teacher DNNs, determines the difference (which may be determined as an error signal) between the outputs and also determines whether the student is continuing to improve or whether the student is no longer improving … evaluating component 128 determine whether to complete another iteration … evaluating component 128 determines whether to continue iterating based on whether the student is continuing to show improvement … evaluating component 128 apply a threshold to determine convergence of the teacher DNN and student DNN output distributions. Where the threshold is not satisfied, iteration may continue, thereby further training the student to approximate the teacher. Where the threshold is satisfied, then convergence is determined (indicating the student output distribution is sufficiently close enough to the teacher DNN' s output distribution) and the student DNN may be considered trained … evaluating component 128 evaluates the student DNN according to the methods 500 and 600”];
generating, by the reference system, a set of additional training data for the student ML system [paragraph 0005, “The student DNN may be iteratively optimized until its output converges with the output of the teacher DNN”; Fig. 3, paragraphs 0034 and 0043, disclose a process of training the student DNN, where, at each iteration, unlabeled input data is provided to both student DNN and teacher DNN (reference system), the outputs from the DNNs are compared to determine the error, if the output of the student DNN has not converged with the output of the teacher DNN, then the student is trained again (complete another iteration) using the output generated by the teacher DNN (reference system) which examiner interprets as the additional training data for the student DNN];
inputting the set of additional training data to the student ML system; and training the student ML system with the additional training data to imitate classifications by the reference system system [Fig. 3, paragraphs 0034 and 0043, disclose a process of training the student DNN, where, at each iteration, unlabeled input data is provided to both student DNN and teacher DNN (reference system), the outputs from the DNNs are compared to determine the error, if the output of the student DNN has not converged with the output of the teacher DNN, then the student is trained again (for another iteration) using the output generated by the teacher DNN (reference system) as a training input; paragraph 0005, “The student DNN may be iteratively optimized until its output converges with the output of the teacher DNN. In this way, the student DNN approaches the behavior of the teacher, so that whatever the output of the teacher, the student will approximate”; It can be seen that the student DNN is iteratively trained using the outputs generated by the teacher DNN (additional training data) as input until its output converges with the output of the teacher DNN (reference system); paragraph 0027, “training an initialized "student" DNN model to approximate a trained teacher DNN model having a larger model size (e.g. number of parameters) than the student”]. 
Li does not teach
wherein the reference system comprises two or more classifiers each configured to produce classifications for corresponding inputs; 
transmitting, by the learning experimentation system, a control parameter based on the observation to the reference system;
generating, by the reference system, a set of additional training data for the student ML system based on the control parameter, wherein the control parameter controls a probability at which classifications produced by each of the two or more classifiers of the reference system are selected for inclusion in the set of additional training data for the student ML system;
Rosswog teaches
transmitting, by the learning experimentation system, a control parameter based on the observation to the reference system [Fig. 2, Col. 12, lines 37-66, “document categorizer 250 may include a performance tracker 254 that tracks one or more metrics associated with the performance of document categorizer 250's categorizations. The metrics may include the number of electronic documents categorized in each category (e.g., relevant and not relevant), the confidence modifiers of all the categorized electronic documents … document categorizer 250 may send an indication to seed set generator 230 (included in an admin subsystem 111) that a second or subsequent seed set of electronic document classifications is needed (control parameter- increasing the training data) to retrain machine learning algorithm 252 to improve its categorization performance”; Examiner interprets an admin subsystem 111 which comprising a seed set generator 230 as a reference system that generating the training data];
generating, by the reference system, a set of additional training data for the student ML system based on the control parameter [Col. 12, line 37 – Col. 13, lines 1-2, “document categorizer 250 may include a performance tracker 254 that tracks one or more metrics associated with the performance of document categorizer 250's categorizations … document categorizer 250 may send an indication to seed set generator 230 (included in an admin subsystem 111) that a second or subsequent seed set of electronic document classifications is needed (control parameter- increasing the training data) to retrain machine learning algorithm 252 to improve its categorization performance … seed set generator 230 may generate additional seed sets based on the metrics tracked by performance tracker 254”];
Since Li teaches a process of iteratively training a student DNN from trained teacher DNN, at each iteration, the student DNN is trained using the output from the teacher DNN, the outputs from the student DNN and the teacher DNN is compared by the evaluating component to determine convergence, if the output has not converged, iteration may continue to further training the student to approximate the teacher, and the teacher DNN will process the input to generate another output (additional training data) that is used to train/retrain the student DNN. 
Li, however, is silent of the teacher DNN (reference system) receiving a signal (control parameter) from the evaluating component (the learning experimentation system) to complete another iteration to generate another output (additional training data) for training the student DNN.  
While Rosswog teaches a seed set generator 230 (reference system) receives a control parameter (to increase the training data) from the learning experimentation system to generate additional training data to retrain machine learning algorithm 252 to improve its categorization performance. 
Therefore, the combination of Li and Rosswog teaches the above claim limitation.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to modify the method for generating a DNN classifier by learning a student DNN model from a larger teacher DNN model of Li to include the reference system to receive from the learning experimentation system, a control parameter for generating a set of additional training data of Rosswog. Doing so would help training or retraining a network with the additional training data generated by the reference system to improve the performance of the model (Rosswog, Col. 5, lines 49-53). 
Li and Rosswog do not teach
the reference system comprises two or more classifiers each configured to produce classifications for corresponding inputs; 
the control parameter controls a probability at which classifications produced by each of the two or more classifiers of the reference system are selected for inclusion in the set of additional training data for the student ML system;
Paquet teaches
the reference system comprises two or more classifiers each configured to produce classifications for corresponding inputs [paragraph 0023, “the automated classifier 32 is developed using a training set 34 comprising a set of content items 14 for which an authoritative and reliable classification 18 into one or more categories… The training set 34 may be generated … by utilizing a sample content set 12 prepared by another automated classifier 32”; paragraph 0006, “Content items having a low classification confidence may be selected and provided to human classifiers, who may identify one or more categories that are associated with the content item. These human-selected classifications of content items may therefore be utilized as a new training set”]; It can be seen that the system of Paquet comprises multiple automated classifiers 32 and human classifiers configured to produce classifications for content items 14; 
the control parameter [paragraph 0032, “a classification confidence threshold 54 defined by the device 82”] controls a probability at which classifications produced by each of the two or more classifiers of the reference system are selected for inclusion in the set of additional training data for the student ML system [paragraph 0023, “the automated classifier 32 is developed using a training set 34 … The training set 34 may be generated … by utilizing a sample content set 12 prepared by another automated classifier 32”; Figs 2 and 4, paragraph 0030, “The automated classifier 32 may be invoked to perform a classification 18 of content items 14 of a content set 12 ( such as the three content items 14 identified in this exemplary scenario 50 as "A", "B", and "C"), and each classification 18 may result in an identified association with one or more categories 16, and also a classification confidence 52 (e.g., computed as a probability between 0.00, indicating no confidence, and 1.00, indicating absolute confidence). An embodiment of these techniques may compare the classification confidence 52 of each classification 18 with a classification confidence threshold 54 (e.g., a 0.50 probability) that distinguishes acceptably confident classifications 18 from unacceptably confident classifications 18. For example, the content item 14 identified as "B" may be classified with a classification confidence 52 of 0.96 that well exceeds a defined classification confidence threshold 54 of 0.50, while the content items 14 identified as "A" and "C" may be classified with unacceptably low classification confidences 52 of 0.24 and 0.03. Accordingly, an embodiment of these techniques may select these content items 14 for inclusion in a supplemental training set 34, and may provide this training set 34 to a human classifier 20 for classification 18. After the human classifier 20 identifies one or more categories 16 associated with each content item 14, these associations may be used in a supplemental training 36 in order to improve the proficiency of the automated classifier 32 in classifying these types of content items 14. (The supplemental training 36 … include … the content items 14 from the initial training set 34, and/or from previously generated supplemental training sets 34.) In this manner, the supplementally trained automated classifier 32 may therefore exhibit a wider range of acceptably accurate classifications 18”; Examiner interprets the classification confidence threshold 54 as a control parameter. Based on the citing above, it can be seen that the system of Paquet comprises at least two classifiers including the automated classifier 32 and human classifier 20 configured to produce classifications for corresponding inputs (items 14), and the probability at which classifications generated by these classifiers are selected for inclusion in the set of additional training data is based on the control parameter (classification confidence threshold 54). As recited in paragraph 0023 above of Paquet, classification confidence threshold 54 is used to determine which content items (with generated classifications are below the classification confidence threshold 54) are selected to provide to the human classifier to classify and are used in a supplemental training, it can be seen that the control parameter (classification confidence threshold 54) controls a probability that the content items which classifications generated by the classifiers (automated and human classifiers) are selected to add to the training data set, because if the classification confidence threshold 54 is set higher, the probability of the content items having  classification scores that are below the threshold is higher resulting to more content items (with generated classifications are below the classification confidence threshold 54) are provided to the human classifier to later adding into the supplemental training data set];
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to modify the method for generating a DNN classifier by learning a student DNN model from a larger teacher DNN model of Li (as modified) to include two or more classifiers for classifying the input data to generate classified data, and the control parameter controls a probability at which classifications produced by each of the two or more classifiers of the reference system are selected for inclusion in the set of additional training data for the student ML system of Paquet. Doing so would help classifying content items with an acceptable classification confidence and accuracy (Paquet, 0029).

As per claim 29, Li, Rosswog and Paquet teach the method of claim 28.
Li further teaches
the observations are a learning behavior and performance of the student ML system [paragraphs 0036-0037, “evaluating component 128 evaluates the output distributions of the student and teacher DNNs, determines the difference (which may be determined as an error signal) between the outputs and also determines whether the student is continuing to improve or whether the student is no longer improving … evaluating component 128 determine whether to complete another iteration … evaluating component 128 determines whether to continue iterating based on whether the student is continuing to show improvement … evaluating component 128 apply a threshold to determine convergence of the teacher DNN and student DNN output distributions. Where the threshold is not satisfied, iteration may continue, thereby further training the student to approximate the teacher. Where the threshold is satisfied, then convergence is determined (indicating the student output distribution is sufficiently close enough to the teacher DNN' s output distribution) and the student DNN may be considered trained … evaluating component 128 evaluates the student DNN according to the methods 500 and 600”; paragraph 0005, “The student DNN may be iteratively optimized until its output converges with the output of the teacher DNN. In this way, the student DNN approaches the behavior of the teacher, so that whatever the output of the teacher, the student will approximate”].  

Claim 30 is substantially similar to claim 20 and thus rejected for similar reasons as claim 20.
Claim 37 is substantially similar to claim 27 and thus rejected for similar reasons as claim 27.

Claims 21, 26, 31 and 36 are rejected under 35 U.S.C. 103 as being unpatentable over Li et al. in view of Rosswog et al. in view of Paquet et al. and further in view of Aslan et al. (US Pub. 2017/0132528).
As per claim 21, Li, Rosswog and Paquet teach the ML computer system of claim 18.
Li, Rosswog and Paquet do not teach
the student ML system performs a first set of computations; the reference system performs a second set of computations; and the first set of computations is a superset of the second set of computations. 
Aslan teaches
the student ML system performs a first set of computations; the reference system performs a second set of computations; and the first set of computations is a superset of the second set of computations.
[paragraph 0055, “FIG. 5 is a schematic diagram of another example technique for joint training of multiple machine learning models. In the example of FIG. 5, an ensemble of Q teacher models 500, represented in FIG. 5 as models 500(1), 500(2), .... 500(Q) can be trained in parallel with a student model 502 … each of the Q teacher models 500 is shown as receiving a respective portion 504.1, 504.2, ... 504.Q of a large set of training data 504 … In this example, the training data 504 can be too large for any one machine learning model 500 to handle because the training data 504 can be too large (in terms of storage footprint) to store on any single computing device on which the machine learning models are executed. Accordingly, each of the teacher models 500 in the set of Q teacher models can run on a computing device with respective portion 504.1-504.Q of the training data 504 that can be maintained on the computing device. In this manner, the multiple teacher models 504 can enable a student model 502 to learn from a relatively large set of training data 504 indirectly through the passing of information between the student model 502 and each of the teacher models 500”; The above reciting discloses each teacher model processes a small portion of training data during training, while the student model processes a larger portion of training data based on the passing information between the models to influence the training of the teacher models, and it can be seen that processing more data requires performing more computations, therefore, the set of computations performed by the student model is a superset of the set of computations performed by the teacher model; Since Surazhsky in Figs. 1-2 teaches the machine learning system comprising a student machine learning system (neural network engine 140) and other systems including the reference system, while Aslan teaches the student ML system performing a first set of computations that is a superset of a second set of computations performed by other systems, thus, the combination of Surazhsky and Aslan teaches the above claim limitation].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to modify the method for generating a DNN classifier by learning a student DNN model from a larger teacher DNN model of Li to include the student ML system performs a first set of computations, the reference system performs a second set of computations, and the first set of computations is a superset of the second set of computations of Aslan. Doing so would help training the machine learning models such that one of the machine learning models influences the training of the other machine learning model (Aslan, 0005).

As per claim 26, Li, Rosswog and Paquet teach the ML computer system of claim 18.
Li, Rosswog and Paquet do not teach
the student ML system imitates the reference system to transfer learning from the reference system to the student ML system.
Aslan teaches
the student ML system imitates the reference system to transfer learning from the reference system to the student ML system [paragraph 0044, “the first (teacher) model 100 of FIG. 1 can comprise a large, complex ensemble of machine learning models … the second (student) model 102 can comprise a much smaller machine learning model … the second model 102 can be trained to mimic the much larger first model 100 (through learning how to approximate the function learned by the first model 100) without significant loss in accuracy of the second model's 102 output”; abstract, “During the training of the first machine learning model, information can be passed between the first machine learning model and the second machine learning model. Such passing of information (or "transfer of knowledge") between the machine learning models can be accomplished via the formulation, and optimization, of an objective function”; paragraph 0006, “During the training of the first machine learning model, information can be passed between the first machine learning model and the second machine learning model. Such passing of information (or "transfer of knowledge") between the machine learning models allows for one machine learning model to influence the other while the multiple machine learning models are trained in parallel”].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to modify the method for generating a DNN classifier by learning a student DNN model from a larger teacher DNN model of Li to include the student ML system imitates the reference system to transfer learning from the reference system to the student ML system of Aslan. Doing so would allow one machine learning model to influence the other while the multiple machine learning models are trained (Aslan, 0051).

Claim 31 is substantially similar to claim 21 and thus rejected for similar reasons as claim 21.
Claim 36 is substantially similar to claim 26 and thus rejected for similar reasons as claim 26.

Claims 22-24, 32-34 are rejected under 35 U.S.C. 103 as being unpatentable over Li et al. in view of Rosswog et al. in view of Paquet et al. in view of Bazrafkan et al. (US Pub. 2018/0211164) and further in view of Chaudhari et al. (US Pub. 2018/0005111).
As per claim 22, Li, Rosswog and Paquet teach the ML computer system of claim 18.
Li further teaches
the student ML system comprises a first neural network, such that the student ML system comprises an input layer, an output layer, and at least a first inner layer between the input and output layers [abstract, “a student DNN”; paragraph 0030, “Like the teacher DNN model, the untrained student DNN model includes a number of hidden layers that may be equal to the number of layers of the teacher”; It can be seen that a deep neural network comprises multiple layers (input, hidden and output layers], wherein:
each layer comprises at least one node, such that the first inner layer comprises at least a first node [paragraph 0030, “the student DNN model size, including the number or nodes or parameters for each layer”];
Li, Rosswog and Paquet do not teach
the first node of the first inner layer outputs an activation value for each set of input values to the first node of the first inner layer;
the activation value is determined based on an activation function for the first node of the first inner layer and based on learned parameters for the first node of the first inner layer; and
following iterations of the training of the student ML system, the learned parameters for the first node of the first inner layer are updated. 
Bazrafkan teaches
the first node of the first inner layer outputs an activation value [paragraph 0020, “training a target network B”; Fig. 4 shows the target network B includes the input, hidden and output layers, Fig. 4 also shows the first node of the inner layer outputs the values indicating a likelihood of the input image being male or female] for each set of input values to the first node of the first inner layer [Fig. 4, paragraph 0051, one output would represent a likelihood of the input image being male, with the other representing a likelihood of the input image being female. In this case, the targets for these outputs could be 1 and 0 with 1 for male, so causing one of the network B output neurons to fire, and with 0 for female, so causing the other neuron to fire; Fig. 1, paragraph 0029, “Network B is designed to perform gender classification and so it produces a single output (LB) indicating the likelihood of an input image containing for example, a male image”; paragraph 0034, “The loss function LB for network B can for example be a categorical cross-entropy”];
the activation value is determined based on learned parameters for the first node of the first inner layer [paragraph 0038, “In the training process, the loss function error back propagates from network B to network A. This tunes network A to generate the best augmentations for network B that can be produced by network A”; Fig. 1, paragraph 0040 discloses the augmented sample data/training data is fed to network B in the training process; paragraph 0029, “Network B is designed to perform gender classification and so it produces a single output (LB) indicating the likelihood of an input image containing for example, a male image”; where the augmented sample data is one of the learned parameters]; and
following iterations of the training of the student ML system, the learned parameters for the first node of the first inner layer are updated [paragraph 0038, “In the training process, the loss function error back propagates from network B to network A. This tunes network A to generate the best augmentations for network B that can be produced by network A”; paragraphs 0053-0054, “when training a neural network, a batch of data X(T) is given to the network and these are used to train instances of network A in parallel and to subsequently train instances of network B in parallel, the instances of network B being fed (at least partially) with augmented samples generated by the instances of network A from processing the samples of batch X(T) … the total loss … is fed back to network A for the subsequent batch … the network parameters are updated based on the loss function(s) … The parameters for network B (including the learned parameters for the first node, weight parameter for example) are updated based on the loss function for network B”];
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to modify the method for generating a DNN classifier by learning a student DNN model from a larger teacher DNN model of Li to include the student ML system comprises multiple layers, the first node of the first inner layer outputs an activation value, the activation value is determined based on learned parameters for the first node of the first inner layer, and following iterations of the training of the student ML system, the learned parameters for the first node of the first inner layer are updated of Bazrafkan. Doing so would help training the neural network based on the updated parameters (Bazrafkan, abstract)
Li, Rosswog, Paquet and Bazrafkan do not teach
the activation value is determined based on an activation function for the first node of the first inner layer;
Chaudhari teaches
the activation value is determined based on an activation function for the first node of the first inner layer [paragraph 0014, “Each neuron node in the first hidden layer may be associated with a corresponding activation function … The activation function corresponding to each neuron node in the first hidden layer may receive as inputs the initial scale parameter from the initial input layer, the bias parameter, the set of training inputs, and a set of weights … The activation function may be executed on the second linear combination and the initial scale parameter to generate an activation result … a respective activation result may be determined for each neuron node in the first hidden layer”; Since Bazrafkan in Fig. 4, paragraph 0029 discloses the first node of the inner layer outputs the values indicating a likelihood of the input image being male or female “Network B is designed to perform gender classification and so it produces a single output (LB) indicating the likelihood of an input image containing for example, a male image”. However, Bazrafkan teaches the node outputs an activation value but does not explicitly teach the output is generated using an activation function. Combining Chaudhari is to add into Bazrafkan the missing elements which is “the activation value is determined based on an activation function for the first node”, therefore, the combination of Bazrafkan and Chaudhari read on the claim limitation];
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to modify the method for generating a DNN classifier by learning a student DNN model from a larger teacher DNN model of Li (as modified) to include the process of determining the activation value based on an activation function for the first node of the first inner layer of Chaudhari. Doing so would help generating a set of classifier outputs of the classifier using the activation function and the scale parameter or providing the activation results as input to a next iteration of the training (Chaudhari, 0003).

As per claim 23, Li, Rosswog, Paquet, Bazrafkan and Chaudhari teach the ML computer system of claim 22.
Paquet further teaches
the reference system comprises a second neural network such that the first and second neural networks have the same architecture [paragraph 0023, “an automated classifier 32, such as an artificial neural network (student ML system) … the automated classifier 32 is developed using a training set 34 … (wherein) The training set 34 may be generated … by utilizing a sample content set 12 prepared by another automated classifier 32”; Examiner interprets “the automated classifier 32 is developed using a training set 34” as a student ML system, and the “another automated classifier 32” that generated the training set is the classifier that included in the reference system, and  the reference system (comprising another automated classifier 32) has an architecture that is the same as an architecture of the student ML system “an automated classifier 32”].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to modify the method for generating a DNN classifier by learning a student DNN model from a larger teacher DNN model of Li to include the reference system comprises a second neural network such that the first and second neural networks have the same architecture of Paquet. Doing so would help determining the categories of the content items of the training data within an acceptable range (Paquet, 0023).

As per claim 24, Li, Rosswog, Paquet, Bazrafkan and Chaudhari teach the ML computer system of claim 22.
Li further teaches
the observations are a learning behavior and performance of the student ML system [paragraphs 0036-0037, “evaluating component 128 evaluates the output distributions of the student and teacher DNNs, determines the difference (which may be determined as an error signal) between the outputs and also determines whether the student is continuing to improve or whether the student is no longer improving … evaluating component 128 determine whether to complete another iteration … evaluating component 128 determines whether to continue iterating based on whether the student is continuing to show improvement … evaluating component 128 apply a threshold to determine convergence of the teacher DNN and student DNN output distributions. Where the threshold is not satisfied, iteration may continue, thereby further training the student to approximate the teacher. Where the threshold is satisfied, then convergence is determined (indicating the student output distribution is sufficiently close enough to the teacher DNN' s output distribution) and the student DNN may be considered trained … evaluating component 128 evaluates the student DNN according to the methods 500 and 600”; paragraph 0005, “The student DNN may be iteratively optimized until its output converges with the output of the teacher DNN. In this way, the student DNN approaches the behavior of the teacher, so that whatever the output of the teacher, the student will approximate”].  

Claim 32 is substantially similar to claim 22 and thus rejected for similar reasons as claim 22.
Claim 33 is substantially similar to claim 23 and thus rejected for similar reasons as claim 23.
Claim 34 is substantially similar to claim 24 and thus rejected for similar reasons as claim 24.

Claims 25 and 35 are rejected under 35 U.S.C. 103 as being unpatentable over Li et al. in view of Rosswog et al. in view of Paquet et al. in view of Bazrafkan et al. in view of Chaudhari et al. in view of Zoph et al. (US Pub. 2019/0251439) and further in view of Li et al. (US Pub. 2018/0240257).
As per claim 25, Li, Rosswog, Paquet, Bazrafkan and Chaudhari teach the ML computer system of claim 22.
Li, Rosswog, Paquet, Bazrafkan and Chaudhari do not teach
a learning coach ML system that is in communication with the student ML system, wherein:
the learning coach ML system has been trained through machine learning to determine one or more revised hyperparameter values for the student ML system;
each of the one or more revised hyperparameter values controls an aspect of the machine learning by the student ML system;
the learning coach ML system comprises an input layer and an output layer;
the learning coach ML system determines the one or more revised hyperparameter values for the student ML system based on an input to the learning coach ML system from the student ML system;
the input to the learning coach ML system from the student ML system is input to the input layer of the learning coach ML system;
the input from the student ML system comprises an internal state observation of the student ML system during training of the student ML system;
 	the internal state observation of the student ML system comprise values related to the learned parameters for the first node on the first inner layer of the student ML system and the activation value for the first node on the first inner layer of the student ML system; and 
the learning coach ML system determines the one or more revised hyperparameter values for the student ML system based on the observations about the first node of the first inner layer of the student ML system from training on the set of additional training data generated by the reference system to improve operation of the student ML system. 
Zoph teaches 
a learning coach ML system that is in communication with the student ML system [abstract, “generating, using a controller neural network, a batch of output sequences, each output sequence in the batch defining a respective architecture of a child neural network that is configured to perform a particular neural network task; for each output sequence in the batch: training a respective instance of the child neural network having the architecture defined by the output sequence”; Examiner interprets the controller neural network as a learning coach ML system, and interprets a child neural network as a student ML system], wherein:
the learning coach ML system has been trained through machine learning to determine one or more revised hyperparameter values for the student ML system [paragraphs 0027-0029, “The controller neural network 110 is a neural network that has parameters, referred to in this specification as "controller parameters," and that is configured to generate output sequences in accordance with the controller parameters. Each output sequence generated by the controller neural network 110 defines a respective possible architecture for the child neural network … each output sequence includes a respective output at each of multiple time steps and each time step in the output sequence corresponds to a different hyperparameter of the architecture of the child neural network. Thus, each output sequence includes, at each time step, a respective value of the corresponding hyperparameter … determines the architecture for the child neural network by training the controller neural network 110 to adjust the values of the controller parameters”];
each of the one or more revised hyperparameter values controls an aspect of the machine learning by the student ML system [paragraphs 0027-0029, “Each output sequence generated by the controller neural network 110 defines a respective possible architecture for the child neural network … each output sequence includes a respective output at each of multiple time steps and each time step in the output sequence corresponds to a different hyperparameter of the architecture of the child neural network. Thus, each output sequence includes, at each time step, a respective value of the corresponding hyperparameter … determines the architecture for the child neural network by training the controller neural network 110 to adjust the values of the controller parameters”];
the learning coach ML system comprises an input layer and an output layer [paragraphs 0038-0039, “The controller neural network 110 is a recurrent neural network that includes one or more recurrent neural network layers, e.g., layers 220 and 230, that are configured to, for each time step, receive as input the value of the hyperparameter corresponding to the preceding time step … The controller neural network 110 also includes a respective output layer for each time step in the output sequence, e.g., output layers 242-254”];
the learning coach ML system determines the one or more revised hyperparameter values for the student ML system based on an input to the learning coach ML system from the student ML system [abstract, “for each output sequence in the batch: training a respective instance of the child neural network having the architecture defined by the output sequence; evaluating a performance of the trained instance of the child neural network on the particular neural network task to determine a performance metric for the trained instance of the child neural network on the particular neural network task; and using the performance metrics for the trained instances of the child neural network to adjust the current values of the controller parameters of the controller neural network”; paragraph 0030, “For each output sequence in the batch 112, the training engine 120 trains an instance of the child neural network that has the architecture defined by the output sequence on the training data 102 and evaluates the performance of the trained instance on the validation set 104. The controller parameter updating engine 130 then uses the results of the evaluations for the output sequences in the batch 112 to update the current values of the controller parameters to improve the expected performance of the architectures defined by the output sequences generated by the controller neural network 110 on the task”];
the input from the student ML system comprises an internal state observation of the student ML system during training of the student ML system [abstract, “for each output sequence in the batch: training a respective instance of the child neural network having the architecture defined by the output sequence; evaluating a performance of the trained instance of the child neural network on the particular neural network task to determine a performance metric for the trained instance of the child neural network on the particular neural network task; and using the performance metrics for the trained instances of the child neural network to adjust the current values of the controller parameters of the controller neural network”];
 	the internal state observation of the student ML system comprise values related to the learned parameters for the first node on the first inner layer of the student ML system and the activation value for the first node on the first inner layer of the student ML system [paragraphs 0069-0071, “For each output sequence in the batch, the system evaluates the performance of the corresponding trained instance of the child neural network on the particular neural network task to determine a performance metric for the trained instance on the particular neural network task (step 306). For example, the performance metric can be an accuracy of the trained instance on the validation set as measured by an appropriate accuracy measure. For example, the accuracy can be a classification error rate when the task is a classification task. As another example, the performance metric can be an average or a maximum of the accuracies of the instance the instance for each of the last two, five, or ten epochs of the training of the instance … The system uses the performance metrics for the trained instances to adjust the current values of the controller parameters … the system adjusts the current values by training the controller neural network to generate output sequences that result in child neural networks having increased performance metrics using a reinforcement learning technique. More specifically, the system trains the controller neural network to generate output sequences that maximize a received reward that is determined based on the performance metrics of the trained instances. In particular, the reward for a given output sequence is a function of the performance metric for the trained instance”]; and 
the learning coach ML system determines the one or more revised hyperparameter values for the student ML system based on the observations about the first node of the first inner layer of the student ML system from training on the set of additional training data generated by the reference system to improve operation of the student ML system [abstract, “for each output sequence in the batch: training a respective instance of the child neural network having the architecture defined by the output sequence; evaluating a performance of the trained instance of the child neural network on the particular neural network task to determine a performance metric for the trained instance of the child neural network on the particular neural network task; and using the performance metrics for the trained instances of the child neural network to adjust the current values of the controller parameters of the controller neural network”; paragraphs 0028-0031, “For each output sequence in the batch 112, the training engine 120 trains an instance of the child neural network that has the architecture defined by the output sequence on the training data 102 and evaluates the performance of the trained instance on the validation set 104. The controller parameter updating engine 130 then uses the results of the evaluations for the output sequences in the batch 112 to update the current values of the controller parameters to improve the expected performance of the architectures defined by the output sequences generated by the controller neural network 110 on the task … each output sequence includes a respective output at each of multiple time steps and each time step in the output sequence corresponds to a different hyperparameter of the architecture of the child neural network. Thus, each output sequence includes, at each time step, a respective value of the corresponding hyperparameter … determines the architecture for the child neural network by training the controller neural network 110 to adjust the values of the controller parameters … repeatedly updating the values of the controller parameters in this manner, the system 100 can train the controller neural network 110 to generate output sequences that result in child neural networks that have increased performance on the particular task”; Since Surazhsky in paragraph 0041 teaches “the training controller 212 may conduct additional training of the convolutional neural network based at least on the performance of the convolutional neural network in processing a mixed training set … The training controller 212 may train the convolutional neural network using additional training data that have been generated by the synthetic image generator 210 and/or the training set generator 216 (reference system) … The training controller 212 may continue to train the convolutional neural network with additional training data until the performance of the convolutional neural network meets a certain threshold value”, while Zoph teaches the controller neural network (the learning coach ML system), based on the performance metrics for the trained instances of the child neural network, adjusts the current values of the controller parameters of the controller neural network to improve the expected performance of the architectures defined by the output sequences (corresponding to a different hyperparameter of the architecture of the child neural network) generated by the controller neural network 110 on the task, and repeatedly updating the values of the controller parameters to increase performance of child neural networks on the particular task, therefore, the combination of Surazhsky and Zoph teaches the claim limitation “the learning coach ML system determines the one or more revised hyperparameter values for the student ML system based on the observations about the first node of the first inner layer of the student ML system from training on the set of additional training data generated by the reference system to improve operation of the student ML system”]. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to modify the method for generating a DNN classifier by learning a student DNN model from a larger teacher DNN model of Li to include a learning coach ML system determines the one or more revised hyperparameter values for the student ML system based on the observations about the performance of the student ML system of Zoph. Doing so would help increasing the performance metric of the student ML system using the updated hyperparameter values (Zoph, 0071).
Li, Rosswog, Paquet, Bazrafkan and Chaudhari and Zoph do not teach
the input to the learning coach ML system from the student ML system is input to the input layer of the learning coach ML system;
Li (2018/0240257) teaches
the input to the learning coach ML system [Fig. 2, the loss neural network] from the student ML system [Fig. 2, the generator neural network] is input to the input layer of the learning coach ML system [Fig. 2 shows output 212 (from the generator neural network/student ML system) which is the input to the loss neural network/learning coach ML is input to the input layer of the loss neural network/learning coach ML; paragraph 0047, “the generator neural network 210 outputs an image, referred to herein as an intermediate output 212, based on the input noise vector. The intermediate output 212 then becomes an input to the loss neural network 240. In turn, the loss neural network 240 outputs style features of the intermediate output 212, referred to herein as intermediate style features 242”; Since Bazrafkan teaches in Fig. 1 that the network A/ learning coach ML system for training a target network B/student ML system, and in paragraph 0036 that output from the network B is used to train the network A, while Li teaches the input from the generator neural network/student ML system to the loss neural network/learning coach ML is input to the input layer of the loss neural network/learning coach ML, therefore, the combination of Bazrafkan and Li read on the claim limitations];
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to modify the method for generating a DNN classifier by learning a student DNN model from a larger teacher DNN model of Li (as modified) to include the input to the learning coach ML system from the student ML system is input to the input layer of the learning coach ML system of Li (2018/0240257). Doing so would help iteratively updating the parameters of the generator neural network to minimize the losses (Li, 0004).

As per claim 35, Li, Rosswog, Paquet, Bazrafkan and Chaudhari teach the method of claim 32.
Li, Rosswog, Paquet, Bazrafkan and Chaudhari do not teach
following the initial training of the student ML system further comprises:
receiving, by a learning coach ML system of the computer system, from the student ML system, internal state observations of the student ML system as the student ML system is being trained on the set of training data generated by the reference system, wherein:
the learning coach ML system has been trained through machine learning to determine one or more revised hyperparameter values for the student ML system, wherein each of the one or more revised hyperparameter values controls an aspect of the machine learning by the student ML system;
the learning coach ML system comprises an input layer and an output layer;
the learning coach ML system determines the one or more revised hyperparameter values for the student ML system based on input to the learning coach ML system from the student ML system;
the input to the learning coach ML system from the student ML system is input to the input layer of the learning coach ML system;
the input from the student ML system to the learning coach ML system comprises the internal state observations of the student ML system; and
the observations about the internal state of the student ML system comprise values related to the learned parameters and the activation values of the first node of the first inner layer of the student ML system; and
determining by the learning coach ML system, the one or more revised hyperparameter values for the student ML system based on the observations about the first node of the first inner layer of the student ML system from training on the set of additional training data generated by the reference system to improve operation of the student ML system.
Zoph teaches 
following the initial training of the student ML system further comprises:
receiving, by a learning coach ML system of the computer system, from the student ML system, internal state observations of the student ML system as the student ML system is being trained on the set of training data generated by the reference system [abstract, “for each output sequence in the batch: training a respective instance of the child neural network having the architecture defined by the output sequence; evaluating a performance of the trained instance of the child neural network on the particular neural network task to determine a performance metric for the trained instance of the child neural network on the particular neural network task; and using the performance metrics for the trained instances of the child neural network to adjust the current values of the controller parameters of the controller neural network”; paragraphs 0028-0031, “For each output sequence in the batch 112, the training engine 120 trains an instance of the child neural network that has the architecture defined by the output sequence on the training data 102 and evaluates the performance of the trained instance on the validation set 104. The controller parameter updating engine 130 then uses the results of the evaluations for the output sequences in the batch 112 to update the current values of the controller parameters to improve the expected performance of the architectures defined by the output sequences generated by the controller neural network 110 on the task; Since Surazhsky in paragraph 0041 teaches the student ML system is being trained on the training data generated by the reference system “the training controller 212 may conduct additional training of the convolutional neural network based at least on the performance of the convolutional neural network in processing a mixed training set … The training controller 212 may train the convolutional neural network using additional training data that have been generated by the synthetic image generator 210 and/or the training set generator 216 (reference system) … The training controller 212 may continue to train the convolutional neural network with additional training data until the performance of the convolutional neural network meets a certain threshold value”, while Zoph teaches the controller neural network (the learning coach ML system), based on the performance metrics for the trained instances of the child neural network, adjusts the current values of the controller parameters of the controller neural network to improve the expected performance of the architectures defined by the output sequences generated by the controller neural network 110 on the task, and repeatedly updating the values of the controller parameters to increase performance of child neural networks on the particular task, therefore, the combination of Surazhsky and Zoph teaches the claim limitation “receiving … internal state observations of the student ML system as the student ML system is being trained on the set of training data generated by the reference system”], wherein:
the learning coach ML system has been trained through machine learning to determine one or more revised hyperparameter values for the student ML system [paragraphs 0027-0029, “The controller neural network 110 is a neural network that has parameters, referred to in this specification as "controller parameters," and that is configured to generate output sequences in accordance with the controller parameters. Each output sequence generated by the controller neural network 110 defines a respective possible architecture for the child neural network … each output sequence includes a respective output at each of multiple time steps and each time step in the output sequence corresponds to a different hyperparameter of the architecture of the child neural network. Thus, each output sequence includes, at each time step, a respective value of the corresponding hyperparameter … determines the architecture for the child neural network by training the controller neural network 110 to adjust the values of the controller parameters”], wherein each of the one or more revised hyperparameter values controls an aspect of the machine learning by the student ML system [paragraphs 0027-0029, “Each output sequence generated by the controller neural network 110 defines a respective possible architecture for the child neural network … each output sequence includes a respective output at each of multiple time steps and each time step in the output sequence corresponds to a different hyperparameter of the architecture of the child neural network. Thus, each output sequence includes, at each time step, a respective value of the corresponding hyperparameter … determines the architecture for the child neural network by training the controller neural network 110 to adjust the values of the controller parameters”];
the learning coach ML system comprises an input layer and an output layer [paragraphs 0038-0039, “The controller neural network 110 is a recurrent neural network that includes one or more recurrent neural network layers, e.g., layers 220 and 230, that are configured to, for each time step, receive as input the value of the hyperparameter corresponding to the preceding time step … The controller neural network 110 also includes a respective output layer for each time step in the output sequence, e.g., output layers 242-254”];
the learning coach ML system determines the one or more revised hyperparameter values for the student ML system based on input to the learning coach ML system from the student ML system [abstract, “for each output sequence in the batch: training a respective instance of the child neural network having the architecture defined by the output sequence; evaluating a performance of the trained instance of the child neural network on the particular neural network task to determine a performance metric for the trained instance of the child neural network on the particular neural network task; and using the performance metrics for the trained instances of the child neural network to adjust the current values of the controller parameters of the controller neural network”; paragraph 0030, “For each output sequence in the batch 112, the training engine 120 trains an instance of the child neural network that has the architecture defined by the output sequence on the training data 102 and evaluates the performance of the trained instance on the validation set 104. The controller parameter updating engine 130 then uses the results of the evaluations for the output sequences in the batch 112 to update the current values of the controller parameters to improve the expected performance of the architectures defined by the output sequences generated by the controller neural network 110 on the task”];
the input from the student ML system to the learning coach ML system comprises the internal state observations of the student ML system [abstract, “for each output sequence in the batch: training a respective instance of the child neural network having the architecture defined by the output sequence; evaluating a performance of the trained instance of the child neural network on the particular neural network task to determine a performance metric for the trained instance of the child neural network on the particular neural network task; and using the performance metrics for the trained instances of the child neural network to adjust the current values of the controller parameters of the controller neural network”]; and
the observations about the internal state of the student ML system comprise values related to the learned parameters and the activation values of the first node of the first inner layer of the student ML system [paragraphs 0069-0071, “For each output sequence in the batch, the system evaluates the performance of the corresponding trained instance of the child neural network on the particular neural network task to determine a performance metric for the trained instance on the particular neural network task (step 306). For example, the performance metric can be an accuracy of the trained instance on the validation set as measured by an appropriate accuracy measure. For example, the accuracy can be a classification error rate when the task is a classification task. As another example, the performance metric can be an average or a maximum of the accuracies of the instance the instance for each of the last two, five, or ten epochs of the training of the instance … The system uses the performance metrics for the trained instances to adjust the current values of the controller parameters … the system adjusts the current values by training the controller neural network to generate output sequences that result in child neural networks having increased performance metrics using a reinforcement learning technique. More specifically, the system trains the controller neural network to generate output sequences that maximize a received reward that is determined based on the performance metrics of the trained instances. In particular, the reward for a given output sequence is a function of the performance metric for the trained instance”]; and 
determining by the learning coach ML system, the one or more revised hyperparameter values for the student ML system based on the observations about the first node of the first inner layer of the student ML system from training on the set of additional training data generated by the reference system to improve operation of the student ML system [abstract, “for each output sequence in the batch: training a respective instance of the child neural network having the architecture defined by the output sequence; evaluating a performance of the trained instance of the child neural network on the particular neural network task to determine a performance metric for the trained instance of the child neural network on the particular neural network task; and using the performance metrics for the trained instances of the child neural network to adjust the current values of the controller parameters of the controller neural network”; paragraphs 0028-0031, “For each output sequence in the batch 112, the training engine 120 trains an instance of the child neural network that has the architecture defined by the output sequence on the training data 102 and evaluates the performance of the trained instance on the validation set 104. The controller parameter updating engine 130 then uses the results of the evaluations for the output sequences in the batch 112 to update the current values of the controller parameters to improve the expected performance of the architectures defined by the output sequences generated by the controller neural network 110 on the task … each output sequence includes a respective output at each of multiple time steps and each time step in the output sequence corresponds to a different hyperparameter of the architecture of the child neural network. Thus, each output sequence includes, at each time step, a respective value of the corresponding hyperparameter … determines the architecture for the child neural network by training the controller neural network 110 to adjust the values of the controller parameters … repeatedly updating the values of the controller parameters in this manner, the system 100 can train the controller neural network 110 to generate output sequences that result in child neural networks that have increased performance on the particular task”; Since Surazhsky in paragraph 0041 teaches “the training controller 212 may conduct additional training of the convolutional neural network based at least on the performance of the convolutional neural network in processing a mixed training set … The training controller 212 may train the convolutional neural network using additional training data that have been generated by the synthetic image generator 210 and/or the training set generator 216 (reference system) … The training controller 212 may continue to train the convolutional neural network with additional training data until the performance of the convolutional neural network meets a certain threshold value”, while Zoph teaches the controller neural network (the learning coach ML system), based on the performance metrics for the trained instances of the child neural network, adjusts the current values of the controller parameters of the controller neural network to improve the expected performance of the architectures defined by the output sequences (corresponding to a different hyperparameter of the architecture of the child neural network) generated by the controller neural network 110 on the task, and repeatedly updating the values of the controller parameters to increase performance of child neural networks on the particular task, therefore, the combination of Surazhsky and Zoph teaches the claim limitation “the learning coach ML system determines the one or more revised hyperparameter values for the student ML system based on the observations about the first node of the first inner layer of the student ML system from training on the set of additional training data generated by the reference system to improve operation of the student ML system”]. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to modify the method of training a machine learning model of Surazhsky to include a learning coach ML system determines the one or more revised hyperparameter values for the student ML system based on the observations about the performance of the student ML system of Zoph. Doing so would help increasing the performance metric of the student ML system using the updated hyperparameter values (Zoph, 0071).
Li, Rosswog, Paquet, Bazrafkan and Chaudhari and Zoph do not teach
the input to the learning coach ML system from the student ML system is input to the input layer of the learning coach ML system;
Li (2018/0240257) teaches
the input to the learning coach ML system [Fig. 2, the loss neural network] from the student ML system [Fig. 2, the generator neural network] is input to the input layer of the learning coach ML system [Fig. 2 shows output 212 (from the generator neural network/student ML system) which is the input to the loss neural network/learning coach ML is input to the input layer of the loss neural network/learning coach ML; paragraph 0047, “the generator neural network 210 outputs an image, referred to herein as an intermediate output 212, based on the input noise vector. The intermediate output 212 then becomes an input to the loss neural network 240. In turn, the loss neural network 240 outputs style features of the intermediate output 212, referred to herein as intermediate style features 242”; Since Bazrafkan teaches in Fig. 1 that the network A/ learning coach ML system for training a target network B/student ML system, and in paragraph 0036 that output from the network B is used to train the network A, while Li teaches the input from the generator neural network/student ML system to the loss neural network/learning coach ML is input to the input layer of the loss neural network/learning coach ML, therefore, the combination of Bazrafkan and Li read on the claim limitations];
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to modify the method of training a neural network of Bazrafkan to include the input to the learning coach ML system from the student ML system is input to the input layer of the learning coach ML system of Li. Doing so would help iteratively updating the parameters of the generator neural network to minimize the losses (Li, 0004).

Prior Art
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure.
Mims (US Patent 7,062,476) describes a method of training a student neural network such that the output of the student approximates the output of the teacher network within a predefined range.
Ghahramani et al. (US Pub. 2018/0060724) describes a method of using an indirect network to generate expected weight distribution to train a direct network.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to TRI T NGUYEN whose telephone number is 571-272-0103. The examiner can normally be reached M-F, 8 AM-5 PM, (CT).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, OMAR FERNANDEZ can be reached at 571-272-2589. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/TRI T NGUYEN/Examiner, Art Unit 2128                                                                                                                                                                                                        
/OMAR F FERNANDEZ RIVAS/Supervisory Patent Examiner, Art Unit 2128
Read full office action
Prosecution Timeline

Feb 13, 2024
Application Filed
May 17, 2025
Non-Final Rejection — §103
Jul 07, 2025
Response Filed
Aug 28, 2025
Final Rejection — §103
Nov 07, 2025
Request for Continued Examination
Nov 16, 2025
Response after Non-Final Action
Dec 23, 2025
Non-Final Rejection — §103
Feb 27, 2026
Response Filed
Apr 03, 2026
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/011,734
Patent 12572820
METHODS AND SYSTEMS FOR GENERATING KNOWLEDGE GRAPHS FROM PROGRAM SOURCE CODE
2y 5m to grant Granted Mar 10, 2026
16/976,398
Patent 12536418
PERTURBATIVE NEURAL NETWORK
2y 5m to grant Granted Jan 27, 2026
17/083,367
Patent 12524662
BLOCKCHAIN FOR ARTIFICIAL INTELLIGENCE TRAINING
2y 5m to grant Granted Jan 13, 2026
17/277,118
Patent 12493963
JOINT UNSUPERVISED OBJECT SEGMENTATION AND INPAINTING
2y 5m to grant Granted Dec 09, 2025
16/972,222
Patent 12468974
QUANTUM CONTROL DEVELOPMENT AND IMPLEMENTATION INTERFACE
2y 5m to grant Granted Nov 11, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

5-6
Expected OA Rounds
68%
Grant Probability
82%
With Interview (+13.2%)
3y 10m
Median Time to Grant
High
PTA Risk
Based on 183 resolved cases by this examiner. Grant probability derived from career allow rate.