Last updated: April 19, 2026
Application No. 17/570,126
BILEVEL METHOD AND SYSTEM FOR DESIGNING MULTI-AGENT SYSTEMS AND SIMULATORS

Final Rejection §103
Filed
Jan 06, 2022
Examiner
MILLER, ALEXANDRIA JOSEPHINE
Art Unit
2142
Tech Center
2100 — Computer Architecture & Software
Assignee
Huawei Technologies Co., Ltd.
OA Round
4 (Final)
This examiner grants 18% of cases after interview

— +71.4% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 27 resolved cases, 2023–2026
Examiner Intelligence

MILLER, ALEXANDRIA JOSEPHINE View full profile →
Grants only 18% of cases
Career Allow Rate
5 granted / 27 resolved
-36.5% vs TC avg
Strong +71% interview lift
Without
With
+71.4%
Interview Lift
resolved cases with interview
Typical timeline
4y 5m
Avg Prosecution
40 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
32.6%
-7.4% vs TC avg
§103
52.4%
+12.4% vs TC avg
§102
3.3%
-36.7% vs TC avg
§112
8.5%
-31.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 27 resolved cases
Office Action

§103
DETAILED ACTION
Claims 1-3, 7-9, 11-14, and 17-21 are presented for examination.
This office action is in response to submission of application on 22-SEPTEMBER-2025.

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 11-JUNE-2025 has been entered.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 14-APRIL-2022 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner. 

Response to Amendment
	The amendment filed 22-SEPTEMBER-2025 in response to the previous office action mailed 11-AUGUST=2025 has been entered. Claims 1-3, 7-9, 11-14, and 17-21 remain pending in the application.
With regards to the non-final office action’s rejections under 103, the amendments to the claims necessitated a new consideration of the art. After this consideration, the examiner respectfully disagrees with the applicant’s arguments that the art referenced in the previous office action does not teach the amendment claim limitations. A new 103 rejection over the prior art has been provided:
Regarding the limitation for the simulated environment, modify both reward functions and transition functions of a stochastic game undertaken by the plurality of lower level agents in the simulated environment of the second framework sub-system:
El-Tantawy teaches that the agents receive rewards that are dependent of the current state and joint action selected by the agents (Paragraph 48). As such, the reward function is modified with the stochastic game undertaken by a plurality of agents in a simulated environment. As this occurs within the stochastic game, it would be in a simulated environment. 
Furthermore, El-Tantawy also teaches a state-action pair with a Q-factor that represents an expected long-run cumulative reward which is updated upon a state transition (Paragraph 45). This would be analogous to modifying a transition function as it describes the relationship between actions chosen by the agents and state transitions.
El-Tantawy does not teach the second framework sub-system or lower level agents, which are taught further below by Palanisamy.
El-Tantawy discloses provide the modified reward and transition functions to the plurality of lower level agents, each of the lower level agents observing a current state of the simulated environment, selecting an action according to its current policy, and receiving an individual reward determined by the respective reward function:
As discussed above, El-Tantawy teaches that the agents receive rewards that are dependent of the current state and joint action selected by the agents (Paragraph 48), and teaches a state-action pair with a Q-factor that represents an expected long-run cumulative reward which is updated upon a state transition (Paragraph 45, which are the modified reward and transition functions, wherein these functions are used by the agents to select actions based on the current state based on the current policy, wherein knowledge of the current state demonstrates observation of the current state of the simulated environment (Paragraph 48). 
Furthermore, each agent receive an individual reward through an individual reward function (Paragraph 53).
El-Tantawy discloses wherein the actions selected by the plurality of lower level agents influence a next state in accordance with the transition functions;
El-Tantawy has previously taught a transition function as part of its system. Furthermore, El-Tantawy teaches that the new state of the game is dependent on the action selected by the agents (Paragraph 48).  While this action is a joint action, each agent still made the decision of the joint action and as such represents the totality of the agents’ actions.
El-Tantawy discloses receive, upon termination of the stochastic game, first feedback from the plurality of lower level agents, the first feedback comprising an outcome of the stochastic game that depends on the learned operational policies
El-Tantawy teaches that each round an agent broadcasts feedback to its neighbors, wherein each agent performs the action (Paragraph 48). Therefore, first feedback from the plurality of lower level agents is receiving, wherein this happens at the end of a round which would be the end of a stochastic game (Paragraph 48). Furthermore, this feedback includes information on policies, wherein the agent selects policies from the feedback based on particular criteria (Paragraph 66). Therefore, the first feedback comprises an outcome of the stochastic game that depends on the learned operational policies.
Regarding the limitation iteratively update the reward functions and the transition functions based on second feedback from the plurality of lower level agents, before generating a subsequent simulated environment:
El-Tantawy teaches that the agents receive rewards the are dependent of the current state and joint action selected by the agents, wherein the next state is generated based on the joint action and current state (Paragraph 48). The joint action selected by the agents affecting the reward would be analogous to feedback from the second framework sub-system updating the reward function as the actions of the agents that the second framework sub-system consists of influence the reward. Furthermore, the generation of the next state would be generation of a subsequent simulated environment.
Furthermore, El-Tantawy also teaches a state-action pair with a Q-factor that represents an expected long-run cumulative reward which is updated upon a state transition (Paragraph 45). This would be analogous to a transition function as it describes the relationship between actions chosen by the agents and state transitions.
Furthermore, El-Tantawy teaches that its process may be repeated for a new state and continued for a finite or infinite number of iterations (Paragraph 48). This would be iteratively updating. 
Within El-Tantawy a higher level agent is not used for this limitation, and the action is instead performed by the agents themselves. However, as Palanisamy below discusses a hierarchical agent structure, using this iterative updating structure of El-Tantawy in combination with Palanisamy would have been obvious as it would have provided the advantage of being able to address further use cases while improving reliability throughout multiple environments, as discussed in Paragraph 4 of Palanisamy. 
El-Tantawy discloses wherein a sequence of simulated environments and updates continuing until a predetermined level of convergence indicative of the optimized interacting set of operational policies is reached:
El-Tantawy teaches that an agent may iteratively receive feedback reward for its generated control policy for an environment and not merely the agent’s individual intersection (Paragraph 26-27). This would be a sequence of simulated environments, in the form of the intersections, and updates, or the feedback rewards, continuing until a predetermined level of convergence indicative of the optimized interacting set of operational policies is reached.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3, 7-9, 11-14, and 17-21 are rejected under 35 U.S.C. 103 as being unpatentable over El-Tantawy et al. (Pub. No. WO 2013086629 A1, filed December 10th 2012, hereinafter El-Tantawy) in view of Palanisamy et al. (Pub. No. US 20200033868 A1, filed July 27th 2018, hereinafter Palanisamy).

Regarding claim 1:
Claim 1 recites:
A computer-implemented system for learning an optimized interacting set of operational policies for implementation by a plurality of lower level agents, each lower level agent being capable of learning an operational policy of the interacting set of operational policies, the system comprising a first framework sub-system and a second framework sub-system, wherein the first framework sub-system comprises a higher level agent and the second framework sub-system comprises the plurality of lower level agents and a simulated environment, wherein the higher level agent is configured to: for the simulated environment, modify both reward functions and transition functions of a stochastic game undertaken by the plurality of lower level agents in the simulated environment of the second framework sub-system; and provide the modified reward and transition functions to the plurality of lower level agents, each of the lower level agents observing a current state of the simulated environment, selecting an action according to its current policy, and receiving an individual reward determined by the respective reward function, wherein the actions selected by the plurality of lower level agents influence a next state in accordance with the transition functions; receive, upon termination of the stochastic game, first feedback from the plurality of lower level agents, the first feedback comprising an outcome of the stochastic game that depends on the learned operational policies: and iteratively update the reward functions and the transition functions based on second feedback from the plurality of lower level agents before generating a subsequent simulated environment, wherein a sequence of simulated environments and updates continuing until a predetermined level of convergence indicative of the optimized interacting set of operational policies is reached,wherein the second framework sub-system is configured to:assign an initial operational policy to each of the plurality of lower level agents of the second framework-subsystem, andupdate the initial operational policies for the plurality of lower level agents based on the second feedback.
Regarding the limitation a computer-implemented system for learning an optimized interacting set of operational policies for implementation by a plurality of lower level agents, each lower level agent being capable of learning an operational policy of the interacting set of operational policies
El-Tantawy teaches a reinforcement learning system for multiple agents in a stochastic game, with the goal being to optimize traffic signal agents (Paragraph 44) such as to reach an equilibrium of the agents’ policies (Paragraph 50). As such, the optimized traffic signal agent’s policies would be an optimized interacting set of operational policies for implementation by multiple agents, where each of the traffic signal agents are capable of learning an operational policy of the interacting set of the operation policies. 
Regarding the use of lower level agents and a hierarchical agent structure, Palanisamy teaches this aspect of the limitation further below. 
Regarding the limitation for the simulated environment, modify both reward functions and transition functions of a stochastic game undertaken by the plurality of lower level agents in the simulated environment of the second framework sub-system:
El-Tantawy teaches that the agents receive rewards that are dependent of the current state and joint action selected by the agents (Paragraph 48). As such, the reward function is modified with the stochastic game undertaken by a plurality of agents in a simulated environment. As this occurs within the stochastic game, it would be in a simulated environment. 
Furthermore, El-Tantawy also teaches a state-action pair with a Q-factor that represents an expected long-run cumulative reward which is updated upon a state transition (Paragraph 45). This would be analogous to modifying a transition function as it describes the relationship between actions chosen by the agents and state transitions.
El-Tantawy does not teach the second framework sub-system or lower level agents, which are taught further below by Palanisamy.
El-Tantawy discloses provide the modified reward and transition functions to the plurality of lower level agents, each of the lower level agents observing a current state of the simulated environment, selecting an action according to its current policy, and receiving an individual reward determined by the respective reward function:
As discussed above, El-Tantawy teaches that the agents receive rewards that are dependent of the current state and joint action selected by the agents (Paragraph 48), and teaches a state-action pair with a Q-factor that represents an expected long-run cumulative reward which is updated upon a state transition (Paragraph 45, which are the modified reward and transition functions, wherein these functions are used by the agents to select actions based on the current state based on the current policy, wherein knowledge of the current state demonstrates observation of the current state of the simulated environment (Paragraph 48). 
Furthermore, each agent receive an individual reward through an individual reward function (Paragraph 53).
El-Tantawy discloses wherein the actions selected by the plurality of lower level agents influence a next state in accordance with the transition functions;
El-Tantawy has previously taught a transition function as part of its system. Furthermore, El-Tantawy teaches that the new state of the game is dependent on the action selected by the agents (Paragraph 48).  While this action is a joint action, each agent still made the decision of the joint action and as such represents the totality of the agents’ actions.
El-Tantawy discloses receive, upon termination of the stochastic game, first feedback from the plurality of lower level agents, the first feedback comprising an outcome of the stochastic game that depends on the learned operational policies
El-Tantawy teaches that each round an agent broadcasts feedback to its neighbors, wherein each agent performs the action (Paragraph 48). Therefore, first feedback from the plurality of lower level agents is receiving, wherein this happens at the end of a round which would be the end of a stochastic game (Paragraph 48). Furthermore, this feedback includes information on policies, wherein the agent selects policies from the feedback based on particular criteria (Paragraph 66). Therefore, the first feedback comprises an outcome of the stochastic game that depends on the learned operational policies.
Regarding the limitation iteratively update the reward functions and the transition functions based on second feedback from the plurality of lower level agents, before generating a subsequent simulated environment:
El-Tantawy teaches that the agents receive rewards the are dependent of the current state and joint action selected by the agents, wherein the next state is generated based on the joint action and current state (Paragraph 48). The joint action selected by the agents affecting the reward would be analogous to feedback from the second framework sub-system updating the reward function as the actions of the agents that the second framework sub-system consists of influence the reward. Furthermore, the generation of the next state would be generation of a subsequent simulated environment.
Furthermore, El-Tantawy also teaches a state-action pair with a Q-factor that represents an expected long-run cumulative reward which is updated upon a state transition (Paragraph 45). This would be analogous to a transition function as it describes the relationship between actions chosen by the agents and state transitions.
Furthermore, El-Tantawy teaches that its process may be repeated for a new state and continued for a finite or infinite number of iterations (Paragraph 48). This would be iteratively updating. 
Within El-Tantawy a higher level agent is not used for this limitation, and the action is instead performed by the agents themselves. However, as Palanisamy below discusses a hierarchical agent structure, using this iterative updating structure of El-Tantawy in combination with Palanisamy would have been obvious as it would have provided the advantage of being able to address further use cases while improving reliability throughout multiple environments, as discussed in Paragraph 4 of Palanisamy. 
El-Tantawy discloses wherein a sequence of simulated environments and updates continuing until a predetermined level of convergence indicative of the optimized interacting set of operational policies is reached:
El-Tantawy teaches that an agent may iteratively receive feedback reward for its generated control policy for an environment and not merely the agent’s individual intersection (Paragraph 26-27). This would be a sequence of simulated environments, in the form of the intersections, and updates, or the feedback rewards, continuing until a predetermined level of convergence indicative of the optimized interacting set of operational policies is reached.
However, El-Tantawy does not teach the system comprising a first framework sub-system and a second framework sub-system, wherein the first framework sub-system comprises a higher level agent and the second framework sub-system comprises the plurality of lower level agents and a simulated environment, wherein the higher level agent is configured to:
Palanisamy teaches a driving policy module that generates parameters describing the one or more policies (Paragraph 14) and contains driving policy learning modules for generating and improving policies based on collective experiences in the experience memory (Paragraph 68). The generation and improvement of policies would include the modification of reward functions previously taught by El-Tantawy in view of Palanisamy. Therefore, this would be analogous to a first framework subsystem that comprises a higher level agent being configured to modify the one or more of the reward functions and the transition functions of a stochastic game undertaken by the plurality of lower level agents in the simulated environment. Furthermore, the experience memory would be equivalent to on feedback from the plurality of lower level agents as both are knowledge of the lower level agents. 
Furthermore, Palanisamy teaches a driver agent module that contains one or more driver agents (Paragraph 14). This would be analogous to the second framework sub-system comprising a plurality of lower level agents since the ‘or more’ driver agents of Palanisamy would encompass a plurality of agents. 
Furthermore, El-Tantawy does not teach wherein the second framework sub-system is configured to: assign an initial operational policy to each of the plurality of lower level agents of the second framework-subsystem, and update the initial operational policies for the plurality of lower level agents based on the feedback.
Palanisamy teaches that when the system is initialized driver agents and driving policy learning modules are instantiated, with driver agents processing policy parameters to execute one or more policies (Paragraph 89). This would be analogous to the lower-level driving agents being assigned an initial operational policy. 
Palanisamy teaches the lower-level driving agent gather driving experiences when are pooled together to form experience memory, or the shared feedback (Paragraph 85). Furthermore, this experience memory is then used to store the driving experiences that are used to improve the policies of the lower-level driving agents (Paragraph 87) that they are initialized with (Paragraph 89). Therefore, Palanisamy does teach using feedback to update the initial operational policies for a plurality of lower-level agents.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to implement a system that utilized the teachings of El-Tantawy and the teachings of Palanisamy. This would have provided the advantage of being able to address further use cases while improving reliability throughout multiple environments, as discussed in Paragraph 4 of Palanisamy. 

Regarding claim 2, which depends upon claim 1:
Claim 2 recites:
The computer-implemented system as claimed in claim 1, wherein the first framework sub-system is configured to update the reward functions and the transition functions based on the modification of the reward functions and the transition functions.
El-Tantawy in view of Palanisamy discloses the system of claim 1 upon which claim 2 depends. Furthermore, regarding the limitation of claim 2:
El-Tantawy teaches that the agents receive rewards that are dependent of the current state and joint action selected by the agents (Paragraph 48). Furthermore, El-Tantawy also teaches a state-action pair with a Q-factor that represents an expected long-run cumulative reward which is updated upon a state transition (Paragraph 45). This would be analogous to a transition function as it describes the relationship between actions chosen by the agents and state transitions. As the rewards were modified by the joint action, then received by the agent, they would be updated based on the modification of the one or both of the reward functions and the transition functions.

Regarding claim 3, which depends upon claim 1:
Claim 3 recites:
The computer-implemented system as claimed in claim 1, wherein the higher level agent is a higher level reinforcement learning agent and the second framework sub-system is implemented as a multi-agent system, wherein the behavior of each individual lower level agent of the plurality of lower level agents in a multi-agent system, is driven by multi-agent reinforcement learning.
El-Tantawy in view of Palanisamy discloses the system of claim 1 upon which claim 3 depends. However, El-Tantawy does not teach wherein the higher level agent is a higher level reinforcement learning agent and the second framework sub-system is implemented as a multi-agent system:
Palanisamy teaches a driving policy module that generates parameters describing the one or more policies (Paragraph 14) and contains driving policy learning modules for generating and improving policies (Paragraph 68). This would be analogous to a higher level reinforcement learning agent. 
Furthermore, Palanisamy teaches a driver agent module that contains one or more driver agents (Paragraph 14). This would be analogous to the second framework sub-system is implemented as a multi-agent system.
Furthermore, El-Tantawy does not teach wherein the behavior of each individual lower level agent of the plurality of lower level agents in a multi-agent system, is driven by multi-agent reinforcement learning:
Palanisamy teaches that driver agents each have an associated deep reinforcement learning algorithm through their associates driving policy learner modules. These modules generate an action for controlling the vehicle (Paragraph 16). Therefore the behavior of each individual lower level agent of the plurality of lower level agents in a multi-agent system, is driven by multi-agent reinforcement learning.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to implement a system that utilized the teachings of El-Tantawy and the teachings of Palanisamy. This would have provided the advantage of improved operational policies to be generated in a MARL framework, as per paragraph 13 of the present application’s specification.

Regarding claim 7, which depends upon claim 1:
Claim 7 recites:
The computer-implemented system as claimed in claim 1, wherein the second framework sub-system is a multi-agent system, wherein the multi-agent system is configured to reach an equilibrium.
El-Tantawy in view of Palanisamy discloses the system of claim 1 upon which claim 7 depends. Furthermore, regarding the limitation of claim 7: 
El-Tantawy teaches that each agent’s objective is to find a joint policy, or in other words an equilibrium (Paragraph 50). This joint policy would be the multi-agent system’s equilibrium.

Regarding claim 8, which depends upon claim 1:
Claim 8 recites:
The computer-implemented system as claimed in claim 1, wherein the first framework sub-system is configured to modify the reward functions and the transition functions using gradient-based methods.
El-Tantawy in view of Palanisamy discloses the system of claim 1 upon which claim 8 depends. However, El-Tantawy does not teach the limitation of claim 8:
Palanisamy teaches that its reinforcement learning algorithm is gradient-based (Paragraph 10). Furthermore, El-Tantawy also teaches a state-action pair with a Q-factor that represents an expected long-run cumulative reward which is updated upon a state transition (Paragraph 45). This would be analogous to a transition function as it describes the relationship between actions chosen by the agents and state transitions.
As reinforcement learning is how the reward functions are modified, this would teach wherein the first framework sub-system is configured to modify the reward functions and the transition functions using gradient-based methods.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to implement a system that utilized the teachings of El-Tantawy and the teachings of Palanisamy. This would have provided the advantage of greater data efficiency and shorter training times, as per paragraph 18 of the present application’s specification.

Regarding claim 9, which depends upon claim 1:
Claim 9 recites:
The computer-implemented system as claimed in claim 1, wherein the first framework sub-system has at least one objective external to objective(s) of the plurality of lower level agents of the second framework sub-system.
El-Tantawy in view of Palanisamy discloses the system of claim 1 upon which claim 9 depends. However, El-Tantawy does not teach the limitation of claim 9:
Palanisamy teaches that the driving policy generation module attempts to learn a globally optimal policy that can drive well under all scenarios (Paragraph 91). This would be an external objective from the plurality of lower level agents as each individual lower level agent is limited to its environment. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to implement a system that utilized the teachings of El-Tantawy and the teachings of Palanisamy. This would have provided the advantage of enabling the system to induce a broad range of desired outcomes, as per paragraph 49 of the present application’s specification. 

Regarding claim 11, which depends upon claim 1:
Claim 11 recites:
The computer-implemented system as claimed in claim 1, wherein the first framework sub-system is further configured to assess whether the updates to the reward functions and the transition functions have produced a set of optimal policies.
El-Tantawy in view of Palanisamy discloses the system of claim 1 upon which claim 11 depends. However, El-Tantawy does not teach the limitation of claim 11:
Palanisamy teaches the learning of a global optimal policy that can drive well under all different scenarios (Paragraph 91). The creation of this global optimal policy would be the assessment of whether updates to the reward function and the transition function have produced a set of optimal policies as it uses the knowledge of the agents in order to construct a ‘true’ optimal policy. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to implement a system that utilized the teachings of El-Tantawy and the teachings of Palanisamy. This would have provided the advantage of the learning process leading the optimal policies that can be used in the real world, as per paragraph 21 of the present application’s specification. 

Regarding claim 12, which depends upon claim 1:
Claim 12 recites:
The computer-implemented system as claimed in claim 1, wherein the first framework sub-system is configured to generate a sequence of unseen environments.
El-Tantawy in view of Palanisamy discloses the system of claim 1 upon which claim 12 depends. Furthermore, regarding the limitation of claim 12:
El-Tantawy teaches that the traffic environments that are created do not need a model for the environment (Paragraph 48). An environment without a model associated with it would be an unseen environment. 

Regarding claim 13, which depends upon claim 1:
Claim 13 recites:
The computer-implemented system as claimed in claim 1, wherein the stochastic game is a Markov game.
El-Tantawy in view of Palanisamy discloses the system of claim 1 upon which claim 13 depends. Furthermore, regarding the limitation of claim 13:
El-Tantawy teaches the use of a Markov game (Paragraph 48).

Regarding claim 14, which depends upon claim 1:
Claim 14 recites:
The computer-implemented system as claimed in claim 1, wherein the plurality of lower level agents of the second framework sub-system are at least partially autonomous vehicles and the policies are driving policies.
El-Tantawy in view of Palanisamy discloses the system of claim 1 upon which claim 14 depends. However, El-Tantawy does not teach the limitation of claim 14:
Palanisamy teaches a driving policy learner module and that an agent controls a vehicle (Paragraph 6). The driving policy learner module would be for a policy that is a driving policy while an agent that controls a vehicle would be an agent that is at least partially an autonomous vehicle. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to implement a system that utilized the teachings of El-Tantawy and the teachings of Palanisamy. This would have provided the advantage of applying the invention to traffic simulations that may allow for traffic light patterns to be changed in a manner that would eliminate or reduce the need for tolls, as per paragraph 24 in the present application’s specification. 

Regarding claim 17, which depends upon claim 15:
Claim 17 recites:
The computer-implemented system as claimed in claim 15, wherein the second framework sub-system is configured to perform an iterative machine learning process comprising repeatedly updating the operational policies until a predetermined level of convergence is reached.
El-Tantawy in view of Palanisamy discloses the system of claim 15 upon which claim 17 depends. Furthermore, regarding the limitation of claim 17:
El-Tantawy teaches that an agent iteratively receives feedback for its actions and adjusts it control policy until it converges to an optimal control policy (Paragraph 27). 

Regarding claim 18, which depends upon claim 1:
Claim 18 recites:
The computer-implemented system as claimed in claim 1, wherein the second framework sub-system is configured to generate the feedback based on performance of the plurality of lower level agents in the simulated environment.
El-Tantawy in view of Palanisamy discloses the system of claim 1 upon which claim 18 depends. El-Tantawy in view of Palanisamy has previous taught lower level agents. Furthermore, regarding the limitation of claim 18:
El-Tantawy teaches that agents may learn from previous agents’ experiences (Paragraph 30). The previous agent experiences would be analogous to feedback generated based on the performance of the plurality of agents in the simulated environment. 

	Claim 19 recites a method that parallels the system of claim 1. Therefore, the analysis discussed above with respect to claim 1 also applies to claim 19. Accordingly, claim 19 is rejected based on substantially the same rationale as set forth above with respect to claim 1.

	Claim 20 recites a non-transitory computer readable storage medium that explicitly parallels the method of claim 19. Therefore, the analysis discussed above with respect to claim 19 also applies to claim 20. Accordingly, claim 20 is rejected based on substantially the same rationale as set forth above with respect to claim 19.

Regarding claim 21, which depends upon claim 7:
Claim 21 recites:
The computer-implemented system as claimed in claim 7, wherein the equilibrium is a Nash equilibrium enabling the second framework sub-system to reach a stable state during training of the plurality of lower level agents.
El-Tantawy in view of Palanisamy discloses the system of claim 7 upon which claim 21 depends. El-Tantawy in view of Palanisamy has previous taught lower level agents. Furthermore, regarding the limitation of claim 21:
El-Tantawy teaches a plurality of agents finding a joint policy using a Nash equilibrium (Paragraph 50). The joint policy would be analogous to a stable state for the plurality of agents. 


Response to Arguments
Applicant’s arguments filed 22-SEPTEMBER-2025 have been fully considered, but the examiner believes that not all are fully persuasive.

Regarding the applicant’s remarks on the non-final office action’s 103 rejection of the claims, the applicant argues that El-Tantawy in view of Palanisamy does not teach the amended limitations of these claims. As such, the applicant argues that all claims dependent on the above would additionally not be obvious under 103. However, the examiner believes that El-Tantawy in view of Palanisamy does teach both the amended and original limitations and respectfully requests applicant’s consideration of the following: 

Regarding the amended claim 1 and 19:
Regarding the limitation for the simulated environment, modify both reward functions and transition functions of a stochastic game undertaken by the plurality of lower level agents in the simulated environment of the second framework sub-system:
El-Tantawy teaches that the agents receive rewards that are dependent of the current state and joint action selected by the agents (Paragraph 48). As such, the reward function is modified with the stochastic game undertaken by a plurality of agents in a simulated environment. As this occurs within the stochastic game, it would be in a simulated environment. 
Furthermore, El-Tantawy also teaches a state-action pair with a Q-factor that represents an expected long-run cumulative reward which is updated upon a state transition (Paragraph 45). This would be analogous to modifying a transition function as it describes the relationship between actions chosen by the agents and state transitions.
El-Tantawy does not teach the second framework sub-system or lower level agents, which are taught further below by Palanisamy.
The applicant argues that El-Tantawy in view of Palanisamy does not disclose provide the modified reward and transition functions to the plurality of lower level agents, each of the lower level agents observing a current state of the simulated environment, selecting an action according to its current policy, and receiving an individual reward determined by the respective reward function. However, regarding this limitation:
As discussed above, El-Tantawy teaches that the agents receive rewards that are dependent of the current state and joint action selected by the agents (Paragraph 48), and teaches a state-action pair with a Q-factor that represents an expected long-run cumulative reward which is updated upon a state transition (Paragraph 45, which are the modified reward and transition functions, wherein these functions are used by the agents to select actions based on the current state based on the current policy, wherein knowledge of the current state demonstrates observation of the current state of the simulated environment (Paragraph 48). 
Furthermore, each agent receive an individual reward through an individual reward function (Paragraph 53).
El-Tantawy discloses wherein the actions selected by the plurality of lower level agents influence a next state in accordance with the transition functions;
El-Tantawy has previously taught a transition function as part of its system. Furthermore, El-Tantawy teaches that the new state of the game is dependent on the action selected by the agents (Paragraph 48).  While this action is a joint action, each agent still made the decision of the joint action and as such represents the totality of the agents’ actions.
El-Tantawy discloses receive, upon termination of the stochastic game, first feedback from the plurality of lower level agents, the first feedback comprising an outcome of the stochastic game that depends on the learned operational policies
El-Tantawy teaches that each round an agent broadcasts feedback to its neighbors, wherein each agent performs the action (Paragraph 48). Therefore, first feedback from the plurality of lower level agents is receiving, wherein this happens at the end of a round which would be the end of a stochastic game (Paragraph 48). Furthermore, this feedback includes information on policies, wherein the agent selects policies from the feedback based on particular criteria (Paragraph 66). Therefore, the first feedback comprises an outcome of the stochastic game that depends on the learned operational policies.
Regarding the limitation iteratively update the reward functions and the transition functions based on second feedback from the plurality of lower level agents, before generating a subsequent simulated environment:
El-Tantawy teaches that the agents receive rewards the are dependent of the current state and joint action selected by the agents, wherein the next state is generated based on the joint action and current state (Paragraph 48). The joint action selected by the agents affecting the reward would be analogous to feedback from the second framework sub-system updating the reward function as the actions of the agents that the second framework sub-system consists of influence the reward. Furthermore, the generation of the next state would be generation of a subsequent simulated environment.
Furthermore, El-Tantawy also teaches a state-action pair with a Q-factor that represents an expected long-run cumulative reward which is updated upon a state transition (Paragraph 45). This would be analogous to a transition function as it describes the relationship between actions chosen by the agents and state transitions.
Furthermore, El-Tantawy teaches that its process may be repeated for a new state and continued for a finite or infinite number of iterations (Paragraph 48). This would be iteratively updating. 
Within El-Tantawy a higher level agent is not used for this limitation, and the action is instead performed by the agents themselves. However, as Palanisamy below discusses a hierarchical agent structure, using this iterative updating structure of El-Tantawy in combination with Palanisamy would have been obvious as it would have provided the advantage of being able to address further use cases while improving reliability throughout multiple environments, as discussed in Paragraph 4 of Palanisamy. 
The applicant argues that the prior art does not teach updating the environments until a convergence is reached for an optimized set of operation policies. However, the examiner believes El-Tantawy discloses the claimed wherein a sequence of simulated environments and updates continuing until a predetermined level of convergence indicative of the optimized interacting set of operational policies is reached:
El-Tantawy teaches that an agent may iteratively receive feedback reward for its generated control policy for an environment and not merely the agent’s individual intersection (Paragraph 26-27). This would be a sequence of simulated environments, in the form of the intersections, and updates, or the feedback rewards, continuing until a predetermined level of convergence indicative of the optimized interacting set of operational policies is reached.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ALEXANDRIA JOSEPHINE MILLER whose telephone number is (703)756-5684. The examiner can normally be reached Monday-Thursday: 7:30 - 5:00 pm, every other Friday 7:30 - 4:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Mariela Reyes can be reached on (571) 270-1006. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/A.J.M./Examiner, Art Unit 2142                    
/Mariela Reyes/Supervisory Patent Examiner, Art Unit 2142
Read full office action
Prosecution Timeline

Jan 06, 2022
Application Filed
Jan 29, 2025
Non-Final Rejection — §103
Mar 14, 2025
Response Filed
Apr 14, 2025
Final Rejection — §103
Jun 11, 2025
Response after Non-Final Action
Jul 10, 2025
Request for Continued Examination
Jul 14, 2025
Response after Non-Final Action
Aug 06, 2025
Non-Final Rejection — §103
Sep 22, 2025
Response Filed
Nov 17, 2025
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/483,110
Patent 12566943
METHOD AND APPARATUS WITH NEURAL NETWORK QUANTIZATION
2y 5m to grant Granted Mar 03, 2026
17/371,348
Patent 12481890
SYSTEMS AND METHODS FOR APPLYING SEMI-DISCRETE CALCULUS TO META MACHINE LEARNING
2y 5m to grant Granted Nov 25, 2025
Study what changed to get past this examiner. Based on 2 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

5-6
Expected OA Rounds
18%
Grant Probability
90%
With Interview (+71.4%)
4y 5m
Median Time to Grant
High
PTA Risk
Based on 27 resolved cases by this examiner. Grant probability derived from career allow rate.