Last updated: April 19, 2026
Application No. 17/678,072
GENERATING TRAINING CURRICULA FOR A PLURALITY OF REINFORCEMENT LEARNING CONTROL AGENTS

Non-Final OA §101§103
Filed
Feb 23, 2022
Examiner
SACKALOSKY, COREY MATTHEW
Art Unit
2128
Tech Center
2100 — Computer Architecture & Software
Assignee
International Business Machines Corporation
OA Round
3 (Non-Final)
This examiner grants 64% of cases after interview

— +49.4% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 25 resolved cases, 2023–2026
Examiner Intelligence

SACKALOSKY, COREY MATTHEW View full profile →
Grants 64% of resolved cases
Career Allow Rate
16 granted / 25 resolved
+9.0% vs TC avg
Strong +49% interview lift
Without
With
+49.4%
Interview Lift
resolved cases with interview
Typical timeline
4y 2m
Avg Prosecution
39 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
42.0%
+2.0% vs TC avg
§103
38.0%
-2.0% vs TC avg
§102
12.9%
-27.1% vs TC avg
§112
7.1%
-32.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 25 resolved cases
Office Action

§101 §103
DETAILED ACTION
This Office Action is in response to the RCE filed on 02/10/2026.
Claims 1, 4, 6, 7, and 15 are currently amended.
Claims 1-20 are currently pending in this application and have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Interpretation
Independent Claims 1, 7, and 15 recite language stating: “constructing a simulated environment of reinforcement learning (RL) control agents reflecting an operating environment of real control agents in a computer system”. Applicant has not defined what a “real control agent” is in their specification and as such the Examiner is interpreting this limitation to mean simulating a real world environment wherein the “real” control agents are akin to human operators in said environment.

Response to Arguments
In reference to Applicant’s arguments on page(s) 8-10 regarding rejections made under 35 U.S.C. 101:
The Office Action rejects claims 1-20 as being allegedly "directed to an abstract idea without significantly more.", (Office Action, page 7). The rejections to the claims are respectfully traversed. Applicant respectfully disagrees with this analysis.
Under the USPTO's 2019 Guidance, for a claim to be directed to non-statutory subject matter, it must recite the abstract idea (Prong One), and the subject matter must not be integrated into a practical application (Prong Two). Applicant respectfully submits, for reasons which will now be discussed, using claim 1 as an example, that the present claims do not satisfy the requirements of Prong One or Prong Two, and that therefore the claims are not directed to an abstract idea.
Applicant submits that the claims, as amended, do not fall within the "performance of the limitation in the mind (including an observation, evaluation, judgement, or opinion) or with the aid of pencil and paper" as stated in the Office Action on page 8. It is respectfully submitted that the claims have been amended to be more clearly directed to patent-eligible subject matter, and that claims 1-20 are allowable under the standards of 35 U.S.C. 101.
Specifically, in part, amended claim 1 states "constructing a simulated environment of reinforcement learning (RL) control agents reflecting an operating environment of real control agents in a computer system" and "constructing a subset of simulated environments for incremental training of the RL control agents utilizing only the set of training curricula until the control criterion is satisfied." "Constructing a simulated environment" is an element that is incapable of being reproduced within a human mind. Therefore, Applicant submits that even if claim 1 were directed to an abstract idea, the alleged abstract idea does not fall within any of the subject matter groupings of abstract ideas enumerated in the 2019 Guidance (i.e., "Mathematical concepts", "Certain methods of organizing human activity", or "Mental Processes"). For at least this reason, claim 1 does not recite an abstract idea, failing Prong One of the USPTO's required analysis. 
Accordingly, for at least the reasons described above, Applicant respectfully submits that the limitations of the pending claims, when analyzed as a whole, are patent-eligible under 35 U.S.C. § 101. Therefore, Applicant respectfully requests that the rejections of claims 1-20 under 35U.S.C.§ 101 be withdrawn.

Examiner’s response:
	Applicant’s arguments have been fully considered but are found to be not persuasive.
	Applicant argues that the abstract idea of “constructing a simulated environment” cannot be performed in the human mind or with the aid of pencil and paper. Examiner agrees, however that limitation was not flagged as being an abstract idea but rather was flagged as indicating a particular field of use or technological environment in which the judicial exception is performed. The limitation that was flagged as being an abstract idea was the limitation stating, in part: “generating a set of training curricula”. Creating a set of training constraints based on predefined parameters is an abstract idea that can be performed in the human mind since it encompasses viewing data and, based on an operating parameter of the system and a training policy, creating a training constraint.
	The rejections made under 35 U.S.C. 101 are maintained.

In reference to Applicant’s arguments on page(s) 10-14 regarding rejections made under 35 U.S.C. 103:
	The Examiner bears the burden of establishing a prima facie case of obviousness based on prior art when rejecting claims under 35 U.S.C. § 103. In re Fritch, 972 F.2d 1260, 23 U.S.P.Q.2d 1780 (Fed. Cir. 1992). The prior art reference (or references when combined) must teach or suggest all the claim limitations. In re Royka, 490 F.2d 981, 180 USPQ 580 (CCPA 1974).
Applicant respectfully asserts that claim 1 is not obvious over Czarnecki in view of Zhou. The asserted combination does not teach or suggest the features of claim 1, as amended.
Claim 1 is not unpatentable over Czarnecki and Zhou, alone or in combination, because the asserted combination does not teach or suggest the features of "constructing a simulated environment of reinforcement learning (RL) control agents reflecting an operating environment of real control agents in a computer system."
Thus, Czarnecki teaches "simulated environment and the agent is implemented as one or more computers interacting with the simulated environment". However, Czarnecki does not teach or suggest "constructing a simulated environment and the agent is implemented as one or more computers interacting with the simulated environment" and "constructing a subset of simulated environments for incremental training of the RL control agents utilizing only the set of training curricula until the control criterion is satisfied" as is required by claim 1.
Dependent claims 2-7, 9-14, and 16-20 ultimately depend from independent claims 1, 8, and 15; therefore, the dependent claims 2-7, 9-14, and 16-20 are allowable at least by virtue of their dependency on allowable independent claims.

Examiner’s response:
	Applicant’s arguments have been fully considered but are moot in light of the amendments made on the claims.
	Applicant argues that the prior art references do not teach the claims as amended. Examiner agrees. 

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 rejected under 35 U.S.C. 101 because they are directed to an abstract idea without significantly more.

Step 1 analysis:
	Independent Claim 1 recites, in part, a computer system, therefore falling into the statutory category of machine. Independent Claim 7 recites, in part, a computer implemented method, therefore falling into the statutory category of process. Independent Claim 15 recites, in part, a computer program product, therefore falling into the statutory category of machine.

Regarding Claim 1:Step 2A: Prong 1 analysis:
Claim 1 recites in part:
“generating a set of training curricula based, at least in part, on (i)at least one operating parameter of the computer system and (i) at least one training policy for the first RL control agent and the RL control agents that are not the first RL control agent”. As drafted and under its broadest reasonable interpretation, this limitation covers performance of the limitation in the mind (including an observation, evaluation, judgement, or opinion) or with the aid of pencil and paper. For example, this limitation encompasses generating a set of data based on predefined parameters.

Accordingly, at Step 2A: Prong 1, the claim is directed to an abstract idea.

Step 2A: Prong 2 analysis:
The judicial exception is not integrated into practical application. In particular, the claim recites the additional elements of:
“a processor set”. This additional element is recited at a high level of generality such that it amounts to no more than mere instructions to apply the exception using a generic computer component (processor) (See MPEP 2106.05(f)).
“one or more computer readable storage media”. This additional element is recited at a high level of generality such that it amounts to no more than mere instructions to apply the exception using a generic computer component (storage) (See MPEP 2106.05(f)).
“program instructions stored on the one or more computer readable storage media to cause the processor set to perform operations”. This additional element is recited at a high level of generality such that it amounts to no more than mere instructions to apply the exception using a generic computer component (instructions) (See MPEP 2106.05(f)).
“constructing a simulated environment of reinforcement learning (RL) control agents reflecting an operating environment of real control agents in a computer system”. This limitation merely indicates a field of use or technological environment in which the judicial exception is performed (computer simulations) and thus fails to add an inventive concept to the claims. See MPEP 2106.05(h).
“obtaining system data describing at least one operating parameter of a first real control agent of the real control agents based, at least in part, on a first RL control agent of the RL control agents failing to satisfy a control criterion for the computer system in the simulated environment”. This additional elements is recited at a high level of generality and amounts to extra-solution activity of gathering data i.e. pre-solution activity of gathering data for use in the claimed process.
“in the simulated environment”. This limitation merely indicates a field of use or technological environment in which the judicial exception is performed (computer simulations) and thus fails to add an inventive concept to the claims. See MPEP 2106.05(h).
“communicate the set of training curricula to the RL control agents”. This additional elements is recited at a high level of generality and amounts to extra-solution activity of gathering data i.e. pre-solution activity of gathering data for use in the claimed process.
“constructing a subset of simulated environments for incremental training of the RL control agents utilizing only the set of training curricula until the control criterion is satisfied”. This limitation merely indicates a field of use or technological environment in which the judicial exception is performed (computer simulations) and thus fails to add an inventive concept to the claims. See MPEP 2106.05(h).

Accordingly at Step 2A: Prong 2, the additional elements individually or in combination do not integrate the judicial exception into a practical application.

Step 2B analysis:
In accordance with Step 2B, the claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception.
As discussed above, the additional element(s) of “a processor set”, “one or more computer readable storage media”, and “program instructions stored on the one or more computer readable storage media to cause the processor set to perform operations” is/are recited at a high-level of generality such that it/they amount(s) to no more than mere instructions to apply the exception using generic computer components (See MPEP 2106.05(f)).
The additional element(s) of “constructing a simulated environment of reinforcement learning (RL) control agents reflecting an operating environment of real control agents in a computer system” and “constructing a subset of simulated environments for incremental training of the RL control agents utilizing only the set of training curricula until the control criterion is satisfied” is/are directed to particular field(s) of use (computer simulations) (MPEP 2106.05(h)) and therefore do not provide significantly more than the abstract idea, and thus the claim is subject-matter ineligible.
The additional element(s) of “obtaining system data describing at least one operating parameter of a first real control agent of the real control agents based, at least in part, on a first RL control agent of the RL control agents failing to satisfy a control criterion for the computer system in the simulated environment” and “communicate the set of training curricula to the RL control agents” is/are recited at a high level of generality and amount(s) to extra-solution activity of receiving data i.e., pre-solution activity of gathering data for use in the claimed process. The courts have found limitations directed to obtaining information electronically, recited at a high level of generality, to be well-understood, routine, and conventional (see MPEP 2106.05(d)(II), “receiving or transmitting data over a network”, "electronic record keeping," and "storing and retrieving information in memory").
Accordingly, at Step 2B, the additional elements individually or in combination do not amount to significantly more than the judicial exception.

Regarding Claim 2:

Step 2A: Prong 1 analysis:
Claim 2 recites in part:
“translating system data into a first set of Markov Decision Processes (MDPs)”. As drafted and under its broadest reasonable interpretation, this limitation covers a mathematical calculation.
“translating the at least one training policy into a second set of MDPs”. As drafted and under its broadest reasonable interpretation, this limitation covers a mathematical calculation.
“determining the set of training curricula based, at least in part, on the first set of MDPs and the second set of MDPs”. As drafted and under its broadest reasonable interpretation, this limitation covers performance of the limitation in the mind (including an observation, evaluation, judgement, or opinion) or with the aid of pencil and paper. For example, this limitation encompasses creating a set of training data based on sets of decision processes.

Accordingly, at Step 2A: Prong 1, the claim is directed to an abstract idea.

Step 2A: Prong 2 analysis:
The claim does not recite any additional elements that integrate the judicial exception into a practical application.

Step 2B analysis:
In accordance with Step 2B, the claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception.

Regarding Claim 3:

Step 2A: Prong 1 analysis:
Claim 3 recites in part:
“combining MDPs of the first set of MDPs and MDPs of the second set of MDPs into vectors”. As drafted and under its broadest reasonable interpretation, this limitation covers performance of the limitation in the mind (including an observation, evaluation, judgement, or opinion) or with the aid of pencil and paper. For example, this limitation encompasses creating a set of vectors based on data.
“generating a graph of MDPs based, at least in part, on the vectors”. As drafted and under its broadest reasonable interpretation, this limitation covers performance of the limitation in the mind (including an observation, evaluation, judgement, or opinion) or with the aid of pencil and paper. For example, this limitation encompasses creating a graph based on a set of vectors.
“linking MDPs based, at least in part, on the generated graph”. As drafted and under its broadest reasonable interpretation, this limitation covers performance of the limitation in the mind (including an observation, evaluation, judgement, or opinion) or with the aid of pencil and paper. For example, this limitation encompasses creating a link between vectors on a graph.

Accordingly, at Step 2A: Prong 1, the claim is directed to an abstract idea.

Step 2A: Prong 2 analysis:
The claim does not recite any additional elements that integrate the judicial exception into a practical application.

Step 2B analysis:
In accordance with Step 2B, the claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception.

Regarding Claim 4:

Step 2A: Prong 1 analysis:Claim 4 recites in part:
“for a first control agent of the plurality of reinforcement learning control agents, factoring one or more MDPs of the first set of MDPs and the second set of MDPs into an adjusted and rearranged sequence of MDPs”. As drafted and under its broadest reasonable interpretation, this limitation covers performance of the limitation in the mind (including an observation, evaluation, judgement, or opinion) or with the aid of pencil and paper. For example, this limitation encompasses rearraigning existing MDPs.
“generating, based, at least in part, on the adjusted and rearranged sequence of MDPs, a training curriculum for the first control agent”. As drafted and under its broadest reasonable interpretation, this limitation covers performance of the limitation in the mind (including an observation, evaluation, judgement, or opinion) or with the aid of pencil and paper. For example, this limitation encompasses generating a set of data based on rearranged MDPs.

Accordingly, at Step 2A: Prong 1, the claim is directed to an abstract idea.

Step 2A: Prong 2 analysis:
The claim does not recite any additional elements that integrate the judicial exception into a practical application.

Step 2B analysis:
In accordance with Step 2B, the claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception.

Regarding Claim 5:

Step 2A: Prong 2 analysis:
The judicial exception is not integrated into practical application. In particular, the claim recites the additional elements of:
“a reward signal representing agent performance; a predicted performance; a safety requirement; and an uncertainty threshold”. This limitation merely indicates a field of use or technological environment in which the judicial exception is performed (reinforcement learning) and thus fails to add an inventive concept to the claims. See MPEP 2106.05(h).

Accordingly at Step 2A: Prong 2, the additional elements individually or in combination do not integrate the judicial exception into a practical application.

Step 2B analysis:
In accordance with Step 2B, the claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception.
The additional element(s) of “a reward signal representing agent performance; a predicted performance; a safety requirement; and an uncertainty threshold” is/are directed to particular field(s) of use (reinforcement learning) (MPEP 2106.05(h)) and therefore do not provide significantly more than the abstract idea, and thus the claim is subject-matter ineligible.
Accordingly, at Step 2B, the additional elements individually or in combination do not amount to significantly more than the judicial exception.

Regarding Claim 6:

Step 2A: Prong 2 analysis:
The judicial exception is not integrated into practical application. In particular, the claim recites the additional elements of:
“wherein the operations further comprise: training the RL control agents according to the training curricula”. This additional element is recited at a high level of generality such that the claim recites only the idea of a solution or outcome i.e., the claim fails to recite details of how a solution to a problem is accomplished.

Accordingly at Step 2A: Prong 2, the additional elements individually or in combination do not integrate the judicial exception into a practical application.

Step 2B analysis:
In accordance with Step 2B, the claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception.
As discussed above, the additional element(s) of “wherein the operations further comprise: training the plurality of reinforcement learning control agents according to the training curricula” is/are recited at a high-level of generality such that the claim recites only the idea of a solution or outcome i.e., the claim fails to recite details of how a solution to a problem is accomplished (See MPEP 2106.05(f)).
Accordingly, at Step 2B, the additional elements individually or in combination do not amount to significantly more than the judicial exception.

Regarding Claim 7:
	Due to claim language similar to that of Claim 1, Claim 7 is rejected for the same reasons as presented above in the rejection of Claim 1.

Regarding Claim 8:
	Due to claim language similar to that of Claim 2, Claim 8 is rejected for the same reasons as presented above in the rejection of Claim 2.

Regarding Claim 9:
	Due to claim language similar to that of Claim 3, Claim 9 is rejected for the same reasons as presented above in the rejection of Claim 3.

Regarding Claim 10:
	Due to claim language similar to that of Claim 4, Claim 10 is rejected for the same reasons as presented above in the rejection of Claim 4.

Regarding Claim 11:
	Due to claim language similar to that of Claim 5, Claim 11 is rejected for the same reasons as presented above in the rejection of Claim 5.

Regarding Claim 12:

Step 2A: Prong 2 analysis:The judicial exception is not integrated into practical application. In particular, the claim recites the additional elements of:
“wherein the generating of the set of training curricula utilizes at least one of the plurality of reinforcement learning control agents”. This additional element is recited at a high level of generality such that the claim recites only the idea of a solution or outcome i.e., the claim fails to recite details of how a solution to a problem is accomplished.

Accordingly at Step 2A: Prong 2, the additional elements individually or in combination do not integrate the judicial exception into a practical application.

Step 2B analysis:
In accordance with Step 2B, the claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception.
As discussed above, the additional element(s) of “wherein the generating of the set of training curricula utilizes at least one of the plurality of reinforcement learning control agents” is/are recited at a high-level of generality such that the claim recites only the idea of a solution or outcome i.e., the claim fails to recite details of how a solution to a problem is accomplished (See MPEP 2106.05(f)).
Accordingly, at Step 2B, the additional elements individually or in combination do not amount to significantly more than the judicial exception.

Regarding Claim 13:

Step 2A: Prong 2 analysis:
The judicial exception is not integrated into practical application. In particular, the claim recites the additional elements of:
“wherein the generating of the set of training curricula utilizes a teacher agent adapted to receive system data from the plurality of reinforcement learning control agents”. This additional element is recited at a high level of generality such that the claim recites only the idea of a solution or outcome i.e., the claim fails to recite details of how a solution to a problem is accomplished.

Accordingly at Step 2A: Prong 2, the additional elements individually or in combination do not integrate the judicial exception into a practical application.

Step 2B analysis:
In accordance with Step 2B, the claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception.
As discussed above, the additional element(s) of “wherein the generating of the set of training curricula utilizes a teacher agent adapted to receive system data from the plurality of reinforcement learning control agents” is/are recited at a high-level of generality such that the claim recites only the idea of a solution or outcome i.e., the claim fails to recite details of how a solution to a problem is accomplished (See MPEP 2106.05(f)).
Accordingly, at Step 2B, the additional elements individually or in combination do not amount to significantly more than the judicial exception.

Regarding Claim 14:
Due to claim language similar to that of Claim 6, Claim 14 is rejected for the same reasons as presented above in the rejection of Claim 6.

Regarding Claim 15:
	Due to claim language similar to that of Claims 1 and 7, Claim 15 is rejected for the same reasons as presented above in the rejection of Claim 1 and 7, with the exception of the limitation(s) covered below.

Step 2A: Prong 2 analysis
The judicial exception is not integrated into practical application. In particular, the claim recites the additional elements of:
“A computer program product comprising one or more computer readable storage media and program instructions collectively stored on the one or more computer readable storage media to perform operations”. This additional element is recited at a high level of generality such that it amounts to no more than mere instructions to apply the exception using a generic computer component (processors) (See MPEP 2106.05(f)).

Accordingly at Step 2A: Prong 2, the additional elements individually or in combination do not integrate the judicial exception into a practical application.

Step 2B analysis:
In accordance with Step 2B, the claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception.
As discussed above, the additional element(s) of “A computer program product comprising one or more computer readable storage media and program instructions collectively stored on the one or more computer readable storage media to perform operations” is/are recited at a high-level of generality such that it/they amount(s) to no more than mere instructions to apply the exception using generic computer components (See MPEP 2106.05(f)).
Accordingly, at Step 2B, the additional elements individually or in combination do not amount to significantly more than the judicial exception.

Regarding Claim 16:
	Due to claim language similar to that of Claims 2 and 8, Claim 16 is rejected for the same reasons as presented above in the rejection of Claim 2 and 8.

Regarding Claim 17:
	Due to claim language similar to that of Claims 3 and 9, Claim 17 is rejected for the same reasons as presented above in the rejection of Claim 3 and 9.

Regarding Claim 18:
	Due to claim language similar to that of Claims 4 and 10, Claim 18 is rejected for the same reasons as presented above in the rejection of Claim 4 and 10.

Regarding Claim 19:
	Due to claim language similar to that of Claims 5 and 11, Claim 19 is rejected for the same reasons as presented above in the rejection of Claim 5 and 11.

Regarding Claim 20:
	Due to claim language similar to that of Claims 6 and 14, Claim 20 is rejected for the same reasons as presented above in the rejection of Claim 6 and 14.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
Claim(s) 1, 6, 7, 12, 14, 15, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Czarnecki et al (US 11113605 B2, hereinafter Czarnecki) in view of Zhou et al (US 20230330846 A1, hereinafter Zhou).

Regarding Claim 1:
	Czarnecki teaches
	A computer system comprising: a processor set (Czarnecki [Col 13 lines 32-36]: "The term “data processing apparatus” refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers");
one or more computer readable storage media (Czarnecki [Col 13 lines 16-22]: "Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non transitory storage medium for execution by, or to control the operation of, data processing apparatus");
program instructions stored on the one or more computer readable storage media to cause the processor set to perform operations (Czarnecki [Col 13 lines 16-22]: "Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non transitory storage medium for execution by, or to control the operation of, data processing apparatus")
constructing a simulated environment of reinforcement learning (RL) control agents reflecting an operating environment of real control agents in a computer system (Czarnecki [Col 5 lines 4-12]: “In some implementations the environment is a simulated environment and the agent is implemented as one or more computers interacting with the simulated environment. The simulated environment may be a motion simulation environment, e.g., a driving simulation or a flight simulation, and the agent may be a simulated vehicle navigating through the motion simulation. In these implementations, the actions may be control inputs to control the simulated user or simulated vehicle”);
obtaining system data describing at least one operating parameter of a first real control agent of the real control agents based, at least in part, on a first RL control agent of the RL control agents failing to satisfy a control criterion for the computer system in the simulated environment (Czarnecki [Col 12 lines 59-64]: "The system can continue to repeat the process 400 until criteria for updating the weights in the mixing data are satisfied, e.g., a certain amount of time has elapsed, a certain amount of training iterations have been performed, or until the final policy network achieves an acceptable level of accuracy on the reinforcement learning task");
generating a set of training curricula based, at least in part, on (i) at least one operating parameter of the computer system and (i) at least one training policy for the first RL control agent and the RL control agents that are not the first RL control agent Czarnecki [Col 2 lines 51-58]: "generating the combined action selection policy may comprise processing the training network input using each of the candidate agent policy neural networks to generate a respective action selection policy (output) for each candidate agent policy neural network, and combining the action selection policies in accordance with the weights as of the training iteration to generate the combined action selection policy.");
communicating the set of training curricula to the RL control agents (Czarnecki [Col 2 lines 51-58]: "generating the combined action selection policy may comprise processing the training network input using each of the candidate agent policy neural networks to generate a respective action selection policy (output) for each candidate agent policy neural network, and combining the action selection policies in accordance with the weights as of the training iteration to generate the combined action selection policy."; (EN): processing the training network input using each of the agent policy NNs is analogous to communicating the training curricula to the reinforcement agents)
Czarnecki does not distinctly disclose
constructing a subset of simulated environments for incremental training of the RL control agents utilizing only the set of training curricula until the control criterion is satisfied.
However, Zhou teaches
constructing a subset of simulated environments for incremental training of the RL control agents utilizing only the set of training curricula until the control criterion is satisfied (Zhou [0050]: “Thus, given an evolution of the state of the environment in a plurality of time steps (“an episode”), the respective domains can be used to produce respective sequences of observations, where each observation in the sequence for a given domain corresponds to a respective one of the time steps and is generated from the state of the environment at the corresponding time step in accordance with the domain. Here the term “evolution” is used to include both an incremental evolution with an increment for each time step, and a continuous evolution which is observed at intervals”; [0090]: "More specifically, the system controls the agent while the policy neural network is conditioned on policy inputs that each include an embedding of the goal demonstration observation only until a training observation is received for which the similarity between the embedding of the training observation and the embedding of the goal demonstration observation satisfies a first criterion").
Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art to combine the methods, systems, and apparatus for reinforcement learning using agent curricula of Czarnecki with the systems and methods that train a policy neural network that is used to control a robot of Zhou. The systems and methods presented in Zhou are beneficial for Czarnecki in that they allow for the creation of virtual or simulated environments for training to be performed in before letting the robot interact with the real world (Zhou [0045]: "In some other cases, during this training, the training engine 160 controls a simulated version of the real robot 102 (or multiple different simulated versions of the real robot 102) in a computer simulation of the real-world environment 104.")

Regarding Claim 6:
	Czarnecki teaches
	The computer system of claim 1, wherein the one or more computer processors are further configured to train the RL control agents according to the training curricula (Czarnecki [Col 3 lines 13-19]: "By using a curriculum over agents in training as described in this specification (i.e., by adjusting the weights as described in this specification), a complex agent can learn, i.e., a complex agent policy selection neural network can be trained, to perform a reinforcement learning task using fewer computational resources and less training time than conventional approaches").

Regarding Claim 7:
	Due to claim language similar to that of Claim 1, Claim 7 is rejected for the same reasons as presented above in the rejection of Claim 1.

Regarding Claim 12:
	Czarnecki teaches
	The computer-implemented method of claim 7, wherein the generating of the set of training curricula utilizes at least one of the plurality of reinforcement learning control agents (Czarnecki [Col 3 lines 13-19]: "By using a curriculum over agents in training as described in this specification (i.e., by adjusting the weights as described in this specification), a complex agent can learn, i.e., a complex agent policy selection neural network can be trained, to perform a reinforcement learning task using fewer computational resources and less training time than conventional approaches"). 

Regarding Claim 14:
	Due to claim language similar to that of Claim 6, Claim 14 is rejected for the same reasons as presented above in the rejection of Claim 6.

Regarding Claim 15:
	Due to claim language similar to that of Claims 1 and 7, Claim 15 is rejected for the same reasons as presented above in the rejection of Claims 1 and 7, with the exception of  the limitation covered below.

	Czarnecki teaches
	A computer program product comprising one or more computer readable storage media and program instructions collectively stored on the one or more computer readable storage media to perform operations (Czarnecki [Col 13 lines 16-22]: "Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non transitory storage medium for execution by, or to control the operation of, data processing apparatus")

Regarding Claim 20:
	Due to claim language similar to that of Claims 6 and 14, Claim 20 is rejected for the same reasons as presented above in the rejection of Claims 6 and 14.

Claim Rejections - 35 USC § 103
Claim(s) 2, 8, and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Czarnecki and Zhou as applied to claims 1, 7, and 20 above, and further in view of Bouton et al (US 20210271988 A1, hereinafter Bouton).

Regarding Claim 2:
	Czarnecki + Zhou does not distinctly disclose
	The computer system of claim 1, wherein generating the set of training curricula comprises: translating system data into a first set of Markov Decision Processes (MDPs);
translating the at least one training policy into a second set of MDPs
determining the set of training curricula based, at least in part, on the first set of MDPs and the second set of MDPs.
However, Bouton teaches
	The computer system of claim 1, wherein the one or more computer processors are further configured to generate the set of training curricula by: translating system data into a first set of Markov Decision Processes (MDPs) (Bouton [0039]: "According to one aspect, Markov decision processes and reinforcement learning may be implemented to address the decision making problem, along with level-k behavior modeling, which inspired the design of the curriculum learning strategy"; [0041]: "Sequential decision making processes may be modeled as Markov Decision Processes (MDPs). MDPs may be defined by the tuple (S, A, T, R, γ) where S is a state space, A is an action space, T is a transition model, R is a reward function, and γ is a discount factor. An agent may choose an action a E A in a given state s and receives a reward r=R(s, a). The environment may then transition into a state s′ according to the distribution Pr(s′|s, a)=T (s, a, s′).");
translating the at least one training policy into a second set of MDPs (Bouton [0039]: "According to one aspect, Markov decision processes and reinforcement learning may be implemented to address the decision making problem, along with level-k behavior modeling, which inspired the design of the curriculum learning strategy"; [0041]: "Sequential decision making processes may be modeled as Markov Decision Processes (MDPs). MDPs may be defined by the tuple (S, A, T, R, γ) where S is a state space, A is an action space, T is a transition model, R is a reward function, and γ is a discount factor. An agent may choose an action a E A in a given state s and receives a reward r=R(s, a). The environment may then transition into a state s′ according to the distribution Pr(s′|s, a)=T (s, a, s′).");
determining the set of training curricula based, at least in part, on the first set of MDPs and the second set of MDPs (Bouton [0039]: "According to one aspect, Markov decision processes and reinforcement learning may be implemented to address the decision making problem, along with level-k behavior modeling, which inspired the design of the curriculum learning strategy").
Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art to combine the methods, systems, and apparatus for reinforcement learning using agent curricula of Czarnecki + Zhou with the system for reinforcement learning with iterative reasoning of Bouton. The systems and methods presented in Bouton are beneficial for Czarnecki + Zhou in that they allow the use of the iterative procedure of Bouton in order to enable other agents to learn robust training policies (Bouton [0038]: "By contrast, the iterative procedure provided by the architecture of FIG. 1 to change the behavior of other agents in the environment enables the learning of robust policies")

Regarding Claim 8:
	Due to claim language similar to that of Claim 2, Claim 8 is rejected for the same reasons as presented above in the rejection of Claim 2.

Regarding Claim 16:
	Due to claim language similar to that of Claims 2 and 8, Claim 16 is rejected for the same reasons as presented above in the rejection of Claims 2 and 8.

Claim Rejections - 35 USC § 103
Claim(s) 3, 5, 9, 11, and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Czarnecki and Zhou as applied to claims 1, 7, and 15 above, and further in view of Levchuk et al (US 20110016067 A1, hereinafter Levchuk) and Cavness et al (US 20200040272 A1, hereinafter Cavness).

Regarding Claim 3:
	Czarnecki + Zhou does not distinctly disclose
The computer system of claim 2, wherein determining the set of training curricula based, at least in part, on the first set of MDPs and the second set of MDPs comprises: combining MDPs of the first set of MDPs and MDPs of the second set of MDPs into vectors;
generating a graph of MDPs based, at least in part, on the vectors;
linking MDPs based, at least in part, on the generated graph.
However, Levchuk teaches
The computer system of claim 2, wherein the one or more computer processors are further configured to determine the set of training curricula based, at least in part, on the first set of MDPs and the second set of MDPs by: combining MDPs of the first set of MDPs and MDPs of the second set of MDPs into vectors (Levchuk [0087]: "This algorithm constructs sets of vectors for each action individually and then focuses on every observation one at a time. The algorithm finds all different combinations of future strategies, while not using the region calculation.");
generating a graph of MDPs based, at least in part, on the vectors (Levchuk [0064]: "In one embodiment, an example of which is shown in FIG. 5, POMDP solution is represented as a deterministic transition graph, in which the nodes, 501, 502, 503, 504 and 505, correspond to the beliefs about the true state of the expertise and are associated with the training action.");
linking MDPs based, at least in part, on the generated graph (Levchuk [0104]: "Using the POMDP policy graph, a trainer picks a starting node (initial assessment of the team knowledge), executes the instructional scenario associated with the current node, receives the performance measures (observation) on the team, selects an instructional scenario (transitions to the next node) based on the observation, and then repeats. The training process is repeated until the node is reached without any outgoing transition links or until a threshold is met.").
Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art to combine the methods, systems, and apparatus for reinforcement learning using agent curricula of Czarnecki + Zhou with the Markov Decision Process graph of Levchuk. The systems and methods presented in Bouton are beneficial for Czarnecki + Zhou in that they allow for the ability to link the various states of the MDP in the graph with the various agents to visualize the process (Levchuk [0038]: " In embodiments, the POMDP captures the dynamic nature of team and individual skills via the Markov decision process graph. Within the graph, a single finite discrete variable indexes the current team expertise state, and external actions control expertise changes.").

Regarding Claim 5:
	Czarnecki + Zhou does not distinctly disclose
	The computer system of claim 1, wherein the control criterion is selected from the group consisting of: a reward signal representing agent performance;
a predicted performance;
an uncertainty threshold.
However, Levchuk teaches
The computer system of claim 1, wherein the control criterion is selected from the group consisting of: a reward signal representing agent performance (Levchuk [0025]: "the reward function further comprises at least one number where each number represents a benefit of subject attaining the expertise state given the training treatment");
a predicted performance (Levchuk [0049]: "This policy can also determine the observations that may reflect the state of the subject and it can also predict the effect of the actions on the subject.");
an uncertainty threshold (Levchuk [0099]: "The model can be made to stop iterating when a threshold is met such as stopping when no improvement in the objective function of expected reward is obtained.").
Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art to combine the methods, systems, and apparatus for reinforcement learning using agent curricula of Czarnecki + Zhou with the Markov Decision Process graph of Levchuk. The systems and methods presented in Bouton are beneficial for Czarnecki + Zhou in that they allow for the ability to link the various states of the MDP in the graph with the various agents to visualize the process (Levchuk [0038]: " In embodiments, the POMDP captures the dynamic nature of team and individual skills via the Markov decision process graph. Within the graph, a single finite discrete variable indexes the current team expertise state, and external actions control expertise changes.").
Czarnecki + Zhou + Levchuk does not distinctly disclose
a safety requirement;
However, Cavness teaches
a safety requirement (Cavness [0038]: "As shown, the flare mitigation system further 100 comprises a MC system 180, which is generally adapted to maintain processing conditions within acceptable operational constraints throughout the system. Such constraints may be determined by economic, practical, and/or safety requirements.")
Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art to combine the methods, systems, and apparatus for reinforcement learning using agent curricula of Czarnecki + Zhou + Levchuk with the designated safety precautions of Cavness. The systems and methods presented in Bouton are beneficial for Czarnecki + Zhou + Levchuk in that they allow for the ability to use any number of safety precautions present in Cavness and apply those safety requirements to wastewater management systems (Cavness [0066]: "It will be appreciated that the separator module 210 may further comprise any number of safety valves adapted to direct flow to a safe and contained area upon over pressurization of the vessel.").

Regarding Claim 9:
	Due to claim language similar to that of Claim 3, Claim 9 is rejected for the same reasons as presented above in the rejection of Claim 3.

Regarding Claim 11:
	Due to claim language similar to that of Claim 5, Claim 11 is rejected for the same reasons as presented above in the rejection of Claim 5.

Regarding Claim 19:
	Due to claim language similar to that of Claims 5 and 11, Claim 19 is rejected for the same reasons as presented above in the rejection of Claims 5 and 11.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
US 12154029 B2 – A method of training an action selection neural network for controlling an agent interacting with an environment to perform different tasks
US 20220358388 A1 – Methods and systems for generating an environment include training transformer models from tabular data and relationship information about the training data
US 20200285995 A1 – a learning system for multi-agent applications

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to COREY M SACKALOSKY whose telephone number is (703)756-1590. The examiner can normally be reached M-F 7:30am-3:30pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Omar Fernandez Rivas can be reached at (571) 272-2589. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/COREY M SACKALOSKY/Examiner, Art Unit 2128                                                                                                                                                                                                        
/OMAR F FERNANDEZ RIVAS/Supervisory Patent Examiner, Art Unit 2128
Read full office action
Prosecution Timeline

Feb 23, 2022
Application Filed
Apr 15, 2025
Non-Final Rejection — §101, §103
Jun 30, 2025
Interview Requested
Jul 29, 2025
Response Filed
Oct 28, 2025
Final Rejection — §101, §103
Dec 09, 2025
Interview Requested
Jan 07, 2026
Response after Non-Final Action
Feb 10, 2026
Request for Continued Examination
Feb 23, 2026
Response after Non-Final Action
Feb 26, 2026
Non-Final Rejection — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/462,421
Patent 12596932
METHOD AND SYSTEM FOR DEPLOYMENT OF PREDICTION MODELS USING SKETCHES GENERATED THROUGH DISTRIBUTED DATA DISTILLATION
2y 5m to grant Granted Apr 07, 2026
17/804,107
Patent 12591759
PARALLEL AND DISTRIBUTED PROCESSING OF PROPOSITIONAL LOGICAL NEURAL NETWORKS
2y 5m to grant Granted Mar 31, 2026
17/812,453
Patent 12572441
FULLY UNSUPERVISED PIPELINE FOR CLUSTERING ANOMALIES DETECTED IN COMPUTERIZED SYSTEMS
2y 5m to grant Granted Mar 10, 2026
17/135,014
Patent 12518197
INCREMENTAL LEARNING WITHOUT FORGETTING FOR CLASSIFICATION AND DETECTION MODELS
2y 5m to grant Granted Jan 06, 2026
17/243,991
Patent 12487763
METHOD AND APPARATUS WITH MEMORY MANAGEMENT AND NEURAL NETWORK OPERATION
2y 5m to grant Granted Dec 02, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
64%
Grant Probability
99%
With Interview (+49.4%)
4y 2m
Median Time to Grant
High
PTA Risk
Based on 25 resolved cases by this examiner. Grant probability derived from career allow rate.