Last updated: July 17, 2026
Application No. 17/566,953
METHOD AND APPARATUS FOR MULTIPLE REINFORCEMENT LEARNING AGENTS IN A SHARED ENVIRONMENT

Final Rejection §103
Filed
Dec 31, 2021
Examiner
CAMPOS, ALFREDO
Art Unit
2129
Tech Center
2100 — Computer Architecture & Software
Assignee
The Governing Council of the University of Toronto
OA Round
4 (Final)
Interview Optional

— -5.0% interview lift. Interview lift (-5.0%) is below the 15.0% threshold. A written response is recommended.
Based on 9 resolved cases, 2023–2026
Examiner Intelligence

CAMPOS, ALFREDO View full profile →
Grants 78% — above average
Career Allowance Rate
7 granted / 9 resolved
+22.8% vs TC avg
Minimal -5% lift
Without
With
+-5.0%
Interview Lift
resolved cases with interview
Typical timeline
3y 6m
Avg Prosecution
18 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
20.2%
-19.8% vs TC avg
§103
76.6%
+36.6% vs TC avg
§112
3.2%
-36.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 9 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant's arguments filed 1/14/2026 have been fully considered but they are not persuasive. 
Regarding 103 arguments in applicant’s remarks page 7, the applicants argues “Without conceding to the correctness of the Examiner's rejections but solely in order to expedite prosecution of the instant application, Applicant has amended independent claims 1 and 12 to explicitly recite that during the episode "each RL agent behaves independently without inter-agent communication" and that the updating is performed "to train a single global unified policy." …
Applicant asserts that Ma and Kim, taken alone or in combination, do not disclose, suggest, or even hint at the limitation where agents behave" each RL agent behaves independently without inter-agent communication". On the contrary, both Ma and Kim fundamentally rely on inter-agent communication to function. Applicant asserts that Ma and Kim teach away from amended claim 1. 
In support of this stance, Applicant submits that Ma introduces a specific "communication block" in the neural network architecture (Figure 2) that allows agents to exchange hidden states. Ma's own ablation study confirms the necessity of communication, showing that a variant omitting the communication block (DHC/Comm) suffers degraded performance, and concluding that the system "achieves much higher success rates due to the communication ability".” The applicant argues that Ma teaches communication however Ma shows that the method is capable of performing without inter-agent communication (Ma page 8704 B. With and Without Heuristic or Communication para 2 line 10-12 To reveal the capacity of communication, we compares DHC with DHC/Comm as shown in the second row of Fig. 5. When the agent density is low, for instance, 4 and 8 agents in both maps, there is little difference between their performances,). 
In further support of this stance, Kim's central teaching is a "Weight-Based Scheduler'' designed to arbitrate access to a shared medium because "agents are allowed to communicate over wireless channels where messages are broadcast to all agents". Kim demonstrates through experiments that its communication-based method outperforms a "no communication" baseline, thereby teaching that communication is essential for performance." as defined in amended claims 1 and 12. The applicant argues that the communication based method outperform the no communication based method. Yet Kim teaches (Kim page 4, MARL: Centralized Critic and Distributed Actor (CCDA) 
    PNG
    media_image1.png
    195
    774
    media_image1.png
    Greyscale
) Thus each agent is capable of "each RL agent behaves independently without inter-agent communication". The specification of the application also teaches in paragraph 0036, The method for training RL agents includes creating episodes where two or more RL agents are deployed in a shared environment, each RL agent behaving independently according to its own policy without inter-agent communication.
A person of ordinary skill in the art would not combine Ma (which requires communication blocks) with Kim (which requires communication scheduling) to arrive at a system that expressly prohibits inter-agent communication. Doing so would destroy the primary function of both prior art systems.” Applicant argues that a person of ordinary skill in the art would no combine Ma with Kim however one of ordinary skill in the art would use the teaches of different arts to be able to fit them like a puzzle to teach the claimed invention (See MPEP 2141.03). Applicant further argues in page 9 “As claims 3 to 11 and 14 to 20 depend directly or indirectly on one of claims 1 or 12, Applicant asserts that these dependent claims would also not be considered obvious in light of Ma, Kim, Budden, Sartoretti, Schaul, Matiisen, and Li, taken alone or in any combination, for at least the same reasons stated above with respect to claims 1 and 12.
Applicant therefore asserts that claims 1, 3 to 12, 14 to 20 submitted herewith comply with 35 U.S.C. § 103 and respectfully requests the Examiner removes these rejections and allows these claims.” Yet applicant argues amended limitations and amend limitations have not been examined rendering the argument moot and not persuasive.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1, 5, 7, 12, 16, 18, 21, and 22 are rejected under 35 U.S.C. 103 as being unpatentable over Ziyuan Ma, Yudong Luo, and Hang Ma. 2021. Distributed Heuristic Multi-Agent Path Finding with Communication. In 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE Press, 8699–8705 ("Ma") in view of Kim et al., Learning to Schedule Communication in Multi-agent Reinforcement Learning (2019) (“Kim”).
Regarding claim 1 and analogous claim 12, Ma teaches a method for training multiple reinforcement learning (RL) agents deployed in a shared computing environment, 
(Ma Page 8699 I. Introduction Para. 1, Multi-Agent Path Finding [1]  is a path arrangement problem for a team of agents [Multiple reinforcement learning agents]. Each agent is required to move from its given start location to its given goal location while avoiding collisions with other agents. MAPF arises in many real world applications of multi-agent systems, such as ware house and office robots [2], [3], aircraft-towing vehicles [4], and video games.
Page 8701 V. Learning Environment A. Environment Setup Para. 1 Line 1-7, We build a discrete grid world environment for MAPF, where each agent only has partial observability [Shared environment]. Many real world robot applications can be naturally transformed into partially observable grid world by each agent equipped with a radar sensor to localize itself and detect surroundings. Formally, the entire space is a m x m binary matrix, where 0 represents a free location and 1 is an obstacle.),
the shared computing environment including one or more execution devices including one or more processors, one or more memories [and one or more communications interfaces,] (Ma C. Multi-Agent Distributed Prioritized Experience Replay Para 3 line 1-3, Fig. 3 illustrates our system flow. In the experiments, we setup sixteen independent actors running on CPUs to generate data, and a single learner on GPU to train. Each actor has a copy of the environment with current Q-network and keep generating new transitions from multiple agents and initializing priorities for them [the shared computing environment including one or more execution devices including one or more processors]. The transitions from all actors are fed into a shared prioritized replay buffer [one or more memories]) (Examiner Note: The processors are considered to be execution devices and are themselves processors),
the method comprising:
executing, by the one or more execution devices, an episode including one or more steps and associated with the shared computing environment, each of the multiple RL agents operating on the one or more execution devices and behaving based at least in part on a global policy stored in the one or more memories throughout the episode (Ma Page 8703 VI. Distributed Heuristic learning with Communication C. Multi-Agent Distributed Prioritized Experience Replay Para 1. Line 10-16, In MAPF, agents has their individual goals instead of a common goal, thus IQL is more suitable for this problem compared with centralized MARL. The appealing merit of IQL is that it avoids the scalability problem in centralized training, which requires to learn a Q-function for joint actions over all agents. The joint action space grows exponentially as the number of agents increases. On the other hand, IQL is naturally appropriate to learn decentralized policy for partially observable settings, because each agent makes decision only based on its own observation. As each agent in the MAPF environment plays the same role as others, to simplify the training process, instead of learning multiple policies for multiple agents, we train a single model from a single agent’s perspective while treating others as part of its environment. The final trained policy [Global Policy] can be applied to each agent [multiple RL agents] for decentralized execution [behaving based at least in part on a global policy]. (i.e. each of the RL agents acts based on the global policy). VII. Experiments Para. 3, For other parameter settings during training, the obstacle density of the environment is sampled from a triangular distribution between 0 and 0.5 with a peak at 0.33 (same as PRIMAL). The FOV size is 9x9 (10x10 in PRIMAL, we make it odd) [shared environment]. The maximum episode length is 256 (i.e. each episode has multiple steps). We train the network with a batch size of 192 and a sequence length of 20 (limit by memory). We use a dynamic learning rate beginning at 10^4 and decreasing by fifty percent at 100k step and 300k step. The maximum training step is 500k), 
wherein each of the multiple RL agents behaves independently [without inter-agent communication] (Ma Page 8699 Abstract line 12-15, Our method treats each agent independently and trains the model from a single agent’s perspective. The final trained policy is applied to each agent for decentralized execution.
I. Introduction para  5 line 15-20, Multi-head attention is employed as convolutional kernel to extract the relation representation between agents. We treat each agent independently and leverage single-agent Qlearning for multi-agent partially observable Markov game without learning a joint action value, making it easy to scale.
Page 8703, C. Multi-Agent Distributed Prioritized Experience Replay para 1 line 10-16, As each agent in the MAPF environment plays the same role as others, to simplify the training process, instead of learning multiple policies for multiple agents, we train a single model from a single agent’s perspective while treating others as part of its environment. The final trained policy can be applied to each agent for decentralized execution [wherein each of the multiple RL agents behaves independently]);
creating, by the one or more execution devices associated with each of the multiple RL agents during the episode, experience tuples, each experience tuple created at an end of each step; and (Ma Page 8701 D. Reward Design Motivated by the common reward design that agents are punished every step [at the end of each step] for not staying on goal to facilitate goal reaching, we design our reward function as shown in Table I. Different from PRIMAL and MAPPER, where agents are penalized more for staying still, we treat every movement and staying (if not on the goal) as the same because in complex cases, one agent should stop and let another agent pass first in order to avoid collision [each of the multiple RL agents during the episode, experience tuples, each experience tuple created at an end of each step;].
Page 8703 C. Multi-Agent Distributed Prioritized Experience Replay Para 3 line 1-3, Fig. 3 illustrates our system flow. In the experiments, we setup sixteen independent actors running on CPUs to generate data, and a single learner on GPU to train [by a processor of the one or more execution devices associated]);
storing, by the one or more execution devices associated with each of the multiple RL agents, the experience tuples in a shared experience replay buffer [via the one or more communications interfaces], the shared experience replay buffer being implemented in a block of memory that stores the experience tuples and shared by the multiple RL agents [via the one or more communications interfaces] throughout the episode and a next episode (Ma Page 8703 C. Multi-Agent Distributed Prioritized Experience Replay Para. 3 Fig. 3 illustrates our system flow. In the experiments, we setup sixteen independent actors running on CPUs to generate data, and a single learner on GPU to train. Each actor has a copy of the environment with current Q-network and keep generating new transitions from multiple agents and initializing priorities for them. The transitions from all actors are fed into a shared prioritized replay buffer [storing, by the one or more execution devices associated with each of the multiple RL agents, the experience tuples in a shared experience replay buffer,]. Then learner samples the most useful experiences from the buffer and updates the network and priorities of the experience. Note that although the model is trained for a single agent, the transitions of all the agents need to be stored for communication purpose, and the priorities are initialized and updated from that agent’s perspective. As priorities are shared, the good experiences explored by any actor can improve the learner [, the shared experience replay buffer being implemented in a block of memory that stores the experience tuples and shared by the multiple RL agents] [throughout the episode and a next episode]).
wherein after the episode the method further includes: updating, by the one or more execution devices of one of the multiple RL agents, the global policy based on sampled experience tuples drawn from the shared experience replay buffer, and (Ma Page 8703, C. Multi-Agent Distributed Prioritized Experience Replay para 1 line 10-16, As each agent in the MAPF environment plays the same role as others, to simplify the training process, instead of learning multiple policies for multiple agents, we train a single model from a single agent’s perspective while treating others as part of its environment. The final trained policy can be applied to each agent for decentralized execution.
Page 8703 C. Multi-Agent Distributed Prioritized Experience Replay Para. 3 Line 1-14, Fig. 3 illustrates our system flow. In the experiments, we setup sixteen independent actors running on CPUs to generate data, and a single learner on GPU to train [by a processor the one or more execution devices of one of the multiple RL agents].The transitions from all actors are fed into a shared prioritized replay buffer. Then learner samples the most useful experiences [experience tuples]  from the buffer and updates the network and priorities of the experience. Note that although the model is trained for a single agent, the transitions of all the agents need to be stored for communication purpose, and the priorities are initialized and updated from that agent’s perspective. As priorities are shared, the good experiences explored by any actor can improve the learner [the global policy based on sampled experience tuples drawn from the shared experience replay buffer to train a single global unified policy]),
distributing, [by a the one or more communications interfaces of one of the multiple RL agents], the updated global policy to the multiple RL agents, wherein each of the multiple RL agents behave based at least in part on the updated global policy in the next episode, (Ma Page 8703, Multi-Agent Distributed Prioritized Experience Replay para 1 line 10-16, As each agent in the MAPF environment plays the same role as others, to simplify the training process, instead of learning multiple policies for multiple agents, we train a single model from a single agent’s perspective while treating others as part of its environment. The final trained policy can be applied to each agent for decentralized execution.

    PNG
    media_image2.png
    222
    569
    media_image2.png
    Greyscale


    PNG
    media_image3.png
    182
    443
    media_image3.png
    Greyscale

(i.e. the updated parameters are the updated policy)). 
Ma does not explicitly teach [the shared computing environment including one or more execution devices including one or more processors, one or more memories and] one or more communications interfaces,
[wherein each of the multiple RL agents behaves independently] without inter-agent communication;
 [storing, by the processor of the one or more execution devices associated with each of the multiple RL agents, the experience tuples in a shared experience replay buffer] via the one or more communications interfaces, [the shared experience replay buffer being implemented in a block of memory that stores the experience tuples and shared by the multiple RL agents] via the one or more communications interfaces [throughout the episode and a next episode]
[distributing,] by a the one or more communications interfaces of one of the multiple RL agents, [the updated global policy to the multiple RL agents], wherein each of the multiple RL agents behave based at least in part on [the updated global policy] in the next episode, wherein fairness among the multiple RL agents operating in the shared computing environment is achieved through the updated global policy distributed to the multiple RL agents without inter- agent communication.
However Kim teaches 
[the shared computing environment including one or more execution devices including one or more processors, one or more memories and] one or more communications interfaces (Page 2 Contributions line 7-12, While prior work on MARL to date considers only the limited bandwidth constraint, we additionally address the shared medium contention issue in what we believe is the first work of its kind: which nodes are granted access to the shared medium. Intuitively, nodes with more important observations should be chosen, for which we adopt a simple yet powerful mechanism called weight-based scheduler (WSA), designed to reconcile simplicity in training with integrity of reflecting real-world MAC protocols in use (e.g., 802.11 Wi-Fi) [one or more communications interfaces])),
[wherein each of the multiple RL agents behaves independently] without inter-agent communication (Page 3 MARL: Centralized Critic and Distributed Actor (CCDA), 
    PNG
    media_image4.png
    217
    769
    media_image4.png
    Greyscale
Page 17, D.2 A VERSION OF Distributed Top(k) In this subsection, we introduce a simple distributed scheduling algorithm, called Distributed Top(k), which can work with SchedNet-Top(k). It is based on CSMA where each sender determines backoff and holding times as follows. In SchedNet, each agent generates the scheduling weight w based on its own observation. The agent sets its backoff time as 1 - w where w is its schedule weight, and it waits for backoff time before it tries to broadcast its message. Once it successfully broadcasts the message, it immediately releases the channel. Thus, the agent with the highest w can grab the channel in a decentralized manner without any message passing. By repeating this for k times, we can realize decentralized Top(k) scheduling.);
[storing, by the processor of the one or more execution devices associated with each of the multiple RL agents, the experience tuples in a shared experience replay buffer] via the one or more communications interfaces, [the shared experience replay buffer being implemented in a block of memory that stores the experience tuples and shared by the multiple RL agents] via the one or more communications interfaces [throughout the episode and a next episode] (Since distributed execution is one of our major operational constraints in SchedNet or other CTDEbased MARL algorithms, Top(k) and Softmax(k) should be realizable via a weight-based mechanism in a distributed manner. In fact, this has been an active research topic to date in wireless networking, where many algorithms exist (Tassiulas & Ephremides, 1992; Yi et al., 2008; Jiang & Walrand, 2010). Due to space limitation, we present how to obtain distributed versions of those two rules based on weights in our supplementary material. To summarize, using so-called CSMA (Carrier Sense Multiple Access) (Kurose, 2005), which is a fully distributed MAC scheduler and forms a basis of Wi-Fi, given agents’ weight values, it is possible to implement Top(k) and Softmax(k). 
Our goal is to train agents so that every time each agent takes an action, only Ksched agents can broadcast their messages with limited size Lband with the goal of receiving the highest cumulative reward via cooperation. Each agent should determine a policy described by its scheduling weights, encoded communication messages, and actions 
Page 12 Algorithm 1, 
    PNG
    media_image5.png
    224
    741
    media_image5.png
    Greyscale
 (i.e. the method trains agents in in multi-agent reinforcement learning method that stores agents experiences via communication devices).);
[distributing,] by a the one or more communications interfaces of one of the multiple RL agents, [the updated global policy to the multiple RL agents, wherein each of the multiple RL agents behave based at least in part on the updated global policy in the next episode], wherein fairness among the multiple RL agents operating in the shared computing environment is achieved through the updated global policy distributed to the multiple RL agents without inter-agent communication (Kim (Kim Page 5 1 Introduction Para 4, Contributions In this paper, we propose a new deep multi-agent reinforcement learning architecture called SchedNet, with the rationale of centralized training and distributed execution in order to achieve a common goal better via decentralized cooperation. During distributed execution, agents are allowed to communicate over wireless channels where messages are broadcast to all agents in each agent’s communication range. This broadcasting feature of wireless communication necessitates a Medium Access Control (MAC) protocol to arbitrate contending communicators in a shared medium. CSMA (Collision Sense Multiple Access) in Wi-Fi is one such MAC protocol [communication interface]
Page 12, 
    PNG
    media_image6.png
    714
    802
    media_image6.png
    Greyscale
),
Page 17  D.1 Carrier Sense Multiple Access (CSMA), Depending on how to choose the backoff and holding times, there can be many variants of CSMA that work for various purposes such as fairness and throughput. Two examples of these, Top(k) and Softmax(k), are introduced in the following sections.
D.2 A version of distributed Top (K), In this subsection, we introduce a simple distributed scheduling algorithm, called Distributed Top(k), which can work with SchedNet-Top(k). It is based on CSMA where each sender determines backoff and holding times as follows. In SchedNet, each agent generates the scheduling weight w based on its own observation. The agent sets its backoff time as 1 - w where w is its schedule weight, and it waits for backoff time before it tries to broadcast its message. Once it successfully broadcasts the message, it immediately releases the channel. Thus, the agent with the highest w can grab the channel in a decentralized manner without any message passing. By repeating this for k times, we can realize decentralized Top(k) scheduling [wherein fairness among the multiple RL agents operating in the shared computing environment is achieved through the updated global policy distributed to the multiple RL agents without inter- agent communication]))
Ma and Kim are both considered to be analogous to the claimed invention because they are in the same field invention of using multi-agent reinforcement learning. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Ma to incorporate the teachings of Kim and disclose communicating between agents in a fair method. Doing so to ensure that agents are able to make sequential decision that accelerate coordination in a distributed manner and behave as a group (Kim Abstract, Many real-world reinforcement learning tasks require multiple agents to make sequential decisions under the agents’ interaction, where well-coordinated actions among the agents are crucial to achieve the target goal better at these tasks. One way to accelerate the coordination effect is to enable multiple agents to communicate with each other in a distributed manner and behave as a group. In this paper, we study a practical scenario when (i) the communication bandwidth is limited and (ii) the agents share the communication medium so that only a restricted number of agents are able to simultaneously use the medium, as in the state-of-the-art wireless networking standards. This calls for a certain form of communication scheduling. In that regard, we propose a multi-agent deep reinforcement learning framework, called SchedNet, in which agents learn how to schedule themselves, how to encode the messages, and how to select actions based on received messages. SchedNet is capable of deciding which agents should be entitled to broadcasting their (encoded) messages, by learning the importance of each agent’s partially observed information. We evaluate SchedNet against multiple baselines under two different applications, namely, cooperative communication and navigation, and predator-prey. Our experiments show a non-negligible performance gap between SchedNet and other mechanisms such as the ones without communication and with vanilla scheduling methods, e.g., round robin, ranging from 32% to 43%.). 
Regarding claim 5 and analogous 16, Ma in view of Kim disclose all the elements of the method of claim 1 and analogous 12. 
Ma further discloses wherein the shared experience buffer is a prioritized experience replay buffer (Ma Page. 8703 Page 8703 VI. Distributed Heuristic Learning with Communication C. Multi-Agent Distributed Prioritized Experience Replay Para. 3 line 6-7, The transitions from all actors are fed into a shared prioritized replay buffer [a prioritized experience replay buffer]).
Regarding claim 7 and analogous 18, Ma in view of Kim disclose all the elements of the method of claim 5 and analogous 16. 
Ma further discloses wherein the multiple RL agents are trained using a multi-staged process, the multi-staged process progressing in difficulty from stage to stage, each stage adding a challenging characteristic associated with the shared computing environment (Ma Page 8703 VII. Experiments Para. 1, Learning directly from a large size environment with lots of agents [the multiple RL agents] is hard. Instead, we use curriculum learning method by gradually introducing more difficult tasks [multi-staged process progressing in difficulty] to agents [20]. Starting from a easy task with only one agent in a 10x10 environment, we establish two new challenging tasks for agents by increasing the agent amount by one or increasing the environment size by five [new challenge characteristic associated with the shared environment], if the success rate of the current task exceeds 0.9. As the training scale grows, the final task with twelve agents in a 40 x 40 environment is reached.
Regarding claim 21, Ma in view of Kim disclose all the elements of the method of claim 1 and analogous 16. 
Ma and Kim are combined in the same rational as set forth above with respect to claim 1 and analogous claim 12.	
Ma further teaches wherein each RL agent behaves independently according to its own policy [without inter-agent communication] (Ma Page 8699 Abstract line 12-15, Our method treats each agent independently and trains the model from a single agent’s perspective. The final trained policy is applied to each agent for decentralized execution.
I. Introduction para  5 line 15-20, Multi-head attention is employed as convolutional kernel to extract the relation representation between agents. We treat each agent independently and leverage single-agent Qlearning for multi-agent partially observable Markov game without learning a joint action value, making it easy to scale.
Page 8703, C. Multi-Agent Distributed Prioritized Experience Replay para 1 line 10-16, As each agent in the MAPF environment plays the same role as others, to simplify the training process, instead of learning multiple policies for multiple agents, we train a single model from a single agent’s perspective while treating others as part of its environment. The final trained policy can be applied to each agent for decentralized execution [wherein each RL agent behaves independently according to its own policy]).
Kim teaches [wherein each RL agent behaves independently according to its own policy] without inter-agent communication (Kim Page 3 MARL: Centralized Critic and Distributed Actor (CCDA),  Page 17, D.2 A VERSION OF Distributed Top(k) In this subsection, we introduce a simple distributed scheduling algorithm, called Distributed Top(k), which can work with SchedNet-Top(k). It is based on CSMA where each sender determines backoff and holding times as follows. In SchedNet, each agent generates the scheduling weight w based on its own observation. The agent sets its backoff time as 1 - w where w is its schedule weight, and it waits for backoff time before it tries to broadcast its message. Once it successfully broadcasts the message, it immediately releases the channel. Thus, the agent with the highest w can grab the channel in a decentralized manner without any message passing. By repeating this for k times, we can realize decentralized Top(k) scheduling [without inter-agent communication]).
Regarding claim 22, Ma in view of Kim disclose all the elements of the method of claim 1 and analogous 16. 
Ma and Kim are combined in the same rational as set forth above with respect to claim 1 and analogous claim 12.	
Kim teaches wherein the fairness among multiple RL agents is achieved by updating the single global unified policy using experiences from all of RL agents that are sharing a link (Kim Page 3-4,  
    PNG
    media_image7.png
    235
    744
    media_image7.png
    Greyscale
   
    PNG
    media_image8.png
    169
    754
    media_image8.png
    Greyscale
 [wherein the fairness among multiple RL agents is achieved by updating the single global unified policy using experiences from all of RL agents]
3 METHOD
3.1 COMMUNICATION ENVIRONMENT AND PROBLEM
In practical scenarios where agents are typically separated but are able to communicate over a shared medium, e.g., a frequency channel in wireless communications, two important constraints are imposed: bandwidth and contention for medium access (Rappaport, 2001). The bandwidth constraint entails a limited amount of bits per unit time, and the contention constraint involves having to avoid collision among multiple transmissions due to the natural aspect of signal broadcasting in wireless communication. Thus, only a restricted number of agents are allowed to transmit their messages each time step for a reliable message transfer. In this paper, we use a simple model to incorporate that the aggregate information size per time step is limited by Lband bits and that only Ksched out of n agents may broadcast their messages [that are sharing a link.].)).
Claim(s) 3, and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Ma in view of Kim and further in view of Budden et al. (US 2020/0265305 A1) (“Budden”).
Regarding claim 3 and analogous 14, Ma in view of Kim disclose all the elements of the method of claim 1 and analogous 12. 
Ma and Kim are combined in the same rational as set forth above with respect to claim 1 and analogous claim 12.
Ma does not explicitly disclose wherein each [experience tuple] includes a state of [the shared environment] at the beginning of each step, an action taken during each step, a state of the shared environment at the end of each step and a reward obtained at the end of each step.
However Budden discloses wherein each [experience tuple] includes a state of [the shared environment computing] at the beginning of each step, an action taken during each step, a state of the shared environment computing at the end of each step and a reward obtained at the end of each step (Budden Para. 0046, The actor computing unit 112 is configured to receive an observation characterizing a current state of the environment instance 132 and to select an action to be performed by the agent 102 using the action selection neural network replica 118 and in accordance with current values of the network parameters . The actor computing unit 112 then obtains transition data characterizing the environment instance subsequent to the agent 102 performing the selected action . For example , the transition data includes the next observation characterizing the next state of the environment instance 132 , i.e. , the state that the environment instance 132 transitioned into as a result of the agent performing the selected action . The transition data also includes a reward which is a numeric value that is received from the environment as a result of the agent 102 performing the selected action . The actor computing unit 112 then generates an experience tuple from the observation , the selected action , and the transition data. An experience tuple , which can be denoted as (                         
                            
                                
                                    S
                                
                                
                                    t
                                
                            
                        
                    ,                         
                            
                                
                                    A
                                
                                
                                    t
                                
                            
                        
                     ,                         
                            
                                
                                    r
                                
                                
                                    t
                                    +
                                    i
                                    +
                                    1
                                
                            
                        
                    ,                         
                            
                                
                                    S
                                
                                
                                    t
                                    +
                                    1
                                
                            
                        
                     ) , includes the current observation                         
                            
                                
                                    S
                                
                                
                                    t
                                
                            
                        
                     [a state of the shared environment], the selected action                         
                            
                                
                                    A
                                
                                
                                    t
                                
                            
                        
                     [an action taken during each step], and the reward                         
                            
                                
                                    r
                                
                                
                                    t
                                    +
                                    i
                                    +
                                    1
                                
                            
                        
                     [a reward obtained at the end of each step], and the next observation                         
                            
                                
                                    S
                                
                                
                                    t
                                    +
                                    1
                                
                            
                        
                    [a state of the shared environment at the end of each step]  that characterizes the next state of the environment after the selected action was performed ).
Ma and Budden are considered to be analogous to the claimed invention because they are in the same field invention of using multi-agent reinforcement learning methods. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Ma to incorporate the teachings of Budden and disclose an experience tuple with the state of share environment, the action taken, the reward, and the end state. Doing so to allow the distributed architecture to work in different systems without limiting performance. (Budden Para. 0009 Line 8 - 24, In addition , by sharing experiences instead of sharing gradients as in some conventional schemes , the system can batch all communications with the centralized replay memory (for example , the system can select batches comprising multiple experience tuples from the replay memory, and use the batch to update the neural network), thereby increasing the efficiency and throughput of the system . With this approach, it is possible for actors and learners to run in different data - centers without limiting performance).
Claim(s) 4, and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Ma in view of Kim and further in view of Sartoretti et al., PRIMAL: Pathfinding via Reinforcement and Imitation Multi-Agent Learning, (2019)  (“Sartoretti”).
Regarding claim 4 analogous claim 15, Ma in view of Kim disclose all the elements of the method of claim 1 and analogous 12. 
Ma and Kim are combined in the same rational as set forth above with respect to claim 1 and analogous claim 12.
Ma further teaches wherein the global policy is updated in a form of gradient descent [such that an adjustment is made to one or more parameters of the global policy based at least in part on gradients of a loss function with respect to each of the one or more parameters] (Ma Page 8703 VI. Distributed Heuristic Learning with Communication C. Multi-Agent Distributed Prioritized Experience Replay Para 1 Para. 4, The final loss function is a multi-step TD error 

    PNG
    media_image9.png
    28
    303
    media_image9.png
    Greyscale
 [the global policy is updated in a form of gradient descent] with                         
                            
                                
                                    R
                                
                                
                                    t
                                
                            
                            =
                            
                                
                                    r
                                
                                
                                    t
                                
                            
                            +
                            γ
                            
                                
                                    r
                                
                                
                                    t
                                    +
                                    1
                                
                            
                            +
                            …
                            +
                            
                                
                                    γ
                                
                                
                                    n
                                
                            
                            Q
                            
                                
                                    
                                        
                                            s
                                        
                                        
                                            t
                                            +
                                            n
                                        
                                    
                                    ,
                                    
                                        
                                            a
                                        
                                        
                                            t
                                            +
                                            n
                                        
                                    
                                    ,
                                    
                                        
                                            θ
                                        
                                        -
                                    
                                
                            
                        
                    , where                         
                            
                                
                                    R
                                
                                
                                    t
                                
                            
                        
                     is the total return of the agent we care about,                         
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                        
                     and                         
                            
                                
                                    a
                                
                                
                                    t
                                
                            
                        
                     are the state and action of that agent, and                         
                            
                                
                                    θ
                                
                                ¯
                            
                        
                      denotes the parameters of the target network, a periodical copy of the online parameters                         
                            θ
                        
                    .
Ma however does not explicitly teach [wherein the global policy is updated in a form of gradient descent] such that an adjustment is made to one or more parameters of the global policy based at least in part on gradients of a loss function with respect to each of the one or more parameters]
However Sartoretti [wherein the global policy is updated in a form of gradient descent] such that an adjustment is made to one or more parameters of the global policy based at least in part on gradients of a loss function with respect to each of the one or more parameters (To update the policy, we use an approximation of the advantage function by bootstrapping using the value function:

    PNG
    media_image10.png
    27
    436
    media_image10.png
    Greyscale

(where k is bounded by the batch size T). We also add an entropy term H(                        
                            π
                        
                    (o)) to the policy loss, which has been shown to encourage exploration and discourage premature convergence [34] by penalizing a policy that always chooses the same actions. The policy loss reads

    PNG
    media_image11.png
    62
    473
    media_image11.png
    Greyscale

with a small entropy weight                         
                            
                                
                                    σ
                                
                                
                                    H
                                
                            
                        
                    (                        
                            
                                
                                    σ
                                
                                
                                    H
                                
                            
                        
                     = 0.01 in practice). We rely on two additional loss functions which help to guide and stabilize training. First, the blocking prediction output is updated by minimizing Lblocking, the log likelihood of predicting incorrectly. Second, we define the loss function Lvalid to minimize the log likelihood of selecting an invalid move [11] [such that an adjustment is made to one or more parameters of the global policy based at least in part on gradients of a loss function with respect to each of the one or more parameters], as mentioned in Section III-B.).
Ma and Sartoretti are both considered to be analogous to the claimed invention because they are in the same field invention of using multi-agent reinforcement learning. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Ma to incorporate the teachings of Sartoretti and disclose adjusting one or more parameters of the global policy based on the loss function. Doing so to ensure that agents are able to make moves that benefit not just themselves but others. (Sartoretti I. Introduction Para. 2 Line 6-15, In this framework, agents learn to take into account the consequences of their position on other agents, in order to favor movements that will benefit the whole team and not only themselves. That is, by simultaneously learning to plan efficient single-agent paths (mostly via RL), and to imitate a centralized expert (IL), agents ultimately learn a decentralized policy where they still exhibit implicit coordination during online path planning without the need for explicit communication among agents.).
Claim(s) 6 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Ma in view of Kim and further in view of Schaul et al, Prioritized Experience Replay, (2016) (“Schaul”).
Regarding claim 6 analogous 17, Ma in view of Kim disclose all the elements of the method of claim 5 and analogous 16. 
Ma and Kim are combined in the same rational as set forth above with respect to claim 1 and analogous claim 12.
Ma further discloses wherein the sampled experience tuples are drawn from the shared experience replay buffer [based on unexpectedness] of each experience tuple stored in the shared experience buffer (Ma Page 8703 VI. Distributed Heuristic Learning with Communication C. Multi-Agent Distributed Prioritized Experience Para. 3 Line 6-13, The transitions from all actors are fed into a shared prioritized replay buffer [each experience tuple stored in the shared experience buffer]. Then learner samples the most useful experiences [experience tuples] from the buffer [shared experience replay buffer]). Then learner samples the most useful experiences from the buffer and updates the network and priorities of the experience. Note that although the model is trained for a single agent, the transitions of all the agents need to be stored for communication purpose, and the priorities are initialized and updated from that agent’s perspective).
Ma does not explicitly disclose [wherein the sampled experience tuples are drawn from the shared experience replay buffer] based on unexpectedness [of each experience tuple stored in the shared experience buffer] based on unexpectedness of each [experience tuple stored in the shared experience buffer], the unexpectedness of each experience tuple determined based on the global policy prior to updating.
However Schaul discloses [wherein the sampled experience tuples are drawn from the shared experience replay buffer] based on unexpectedness of each [experience tuple stored in the shared experience buffer], the unexpectedness of each experience tuple determined based on the global policy prior to updating (Schaul et al. Page 4 Para 1 to 2, page 5 Algorithm 1, However, greedy TD-error prioritization has several issues. First, to avoid expensive sweeps over the entire replay memory, TD errors are only updated for the transitions (i.e. experience tuple) that are replayed. One consequence is that transitions that have a low TD error on first visit may not be replayed for a long time (which means effectively never with a sliding window replay memory). Further, it is sensitive to noise spikes (e.g. when rewards are stochastic), which can be exacerbated by bootstrapping, where approximation errors appear as another source of noise.
Finally, greedy prioritization focuses on a small subset of the experience: errors shrink slowly, especially when using function approximation, meaning that the initially high error transitions get replayed frequently. This lack of diversity that makes the system prone to over-fitting. To overcome these issues, we introduce a stochastic sampling method  [unexpectedness] that interpolates between pure greedy prioritization and uniform random sampling. We ensure that the probability of being sampled is monotonic in a transition’s priority, while guaranteeing a non-zero probability even for the lowest-priority transition. Concretely, we define the probability of sampling transition i as

    PNG
    media_image12.png
    95
    609
    media_image12.png
    Greyscale

 where                         
                            
                                
                                    p
                                
                                
                                    i
                                
                            
                            >
                            0
                        
                     is the priority of transition i. The exponent                         
                            α
                        
                     determines how much prioritization is used, with α = 0 corresponding to the uniform case.

    PNG
    media_image13.png
    593
    1042
    media_image13.png
    Greyscale

[the unexpectedness of each experience tuple determined based on the global policy prior to updating]).
Ma and Schaul are considered to be analogous to the claimed invention because they are in the same field of invention of reinforcement learning methods. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Ma to incorporate the teachings of Schaul and disclose drawing experience tuples based on stochastic process. Doing so to increase performance and avoid expensive sweeps over the entire replay memory. (Schaul Page 3 3.3 Stochastic Prioritization, However, greedy TD-error prioritization has several issues. First, to avoid expensive sweeps over the entire replay memory, TD errors are only updated for the transitions that are replayed.)
Claim(s) 8 are rejected under 35 U.S.C. 103 as being unpatentable over Ma in view of Kim and further in view of Matiisen et al. Teacher-Student Curriculum Learning (2017) (Matiisen).
Regarding claim 8, Ma in view of Kim disclose all the elements of the method of claim 7 and analogous 18. 
Ma and Kim are combined in the same rational as set forth above with respect to claim 1 and analogous claim 12.
Ma does not explicitly teach wherein [the shared experience replay buffer] is divided into multiple segments, each segment associated with each stage of [the multi-staged process].
	However Matiisen discloses wherein [the shared experience replay buffer] is divided into multiple segments, each segment associated with each stage of [the multi-staged process] ( Matiisen Page 3-5 3 Algorithms Para 2,
Figure 2 is a demonstration of the ideal training progress in a curriculum learning setting (i.e. multi-stage process): 
1. At first, the Teacher has no knowledge so it samples from all tasks uniformly.
2. When the Student starts making progress on task 1, the Teacher allocates more probability mass to this task.
3. When the Student masters task 1, its learning curve flattens and the Teacher samples the task less often. At this point Student also starts making progress on task 2, so the Teacher samples more from task 2.
4. This continues until the Student masters all tasks. As all task learning curves flatten in the end, the Teacher returns to uniform sampling of the tasks.
3.4 Sampling algorithm
The previous algorithms require tuning of hyperparameters to balance exploration. To get rid of exploration hyperparameters, we take inspiration from Thompson sampling. The Sampling algorithm keeps a buffer of last K rewards for each task [divided into multiple segments]. To choose the next task, a recent reward is sampled from each task’s K-last-rewards buffer [each segment associated with each stage of the multi-staged process]. Then whichever task yielded the highest sampled reward is chosen. This makes exploration a natural part of the algorithm: tasks that have recently had high rewards are sampled more often.).
Ma and Matiisen are considered to be analogous to the claimed invention because they are in the same field invention of using reinforcement learning methods. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Ma to incorporate the teachings of Matiisen and disclose dividing the buffer into multiple sections that contain pervious stage experiences. Doing so to allow the agents to learn from simpler task when performing harder task. (Matiisen 1 Introduction Para 1 line 7-10, where tasks are ordered by increasing difficulty and training only proceeds to harder tasks once easier ones are mastered. Curriculum learning helps when after mastering a simpler task the policy for a harder task is discoverable through random exploration).
Claim(s) 9 are rejected under 35 U.S.C. 103 as being unpatentable over Ma in view of Kim and further in view Li et al., "TCP-NeuRoc: Neural Adaptive TCP Congestion Control With Online Changepoint Detection," in IEEE Journal on Selected Areas in Communications, vol. 39, no. 8, pp. 2461-2475, Aug. 2021, (2021) (“Li”).
Regarding claim 9, Ma in view of Kim disclose all the elements of the method of claim 7 and analogous claim 17. 
Ma and Kim are combined in the same rational as set forth above with respect to claim 1 and analogous claim 12.
Ma does not explicitly teach wherein the multi-staged process [includes a bootstrapping stage, an advancing stage, a fairness training stage].
However Li teaches wherein [the multi-staged process] includes a bootstrapping stage, an advancing stage, a fairness training stage. (Li Page 2468
C. Cold-Started Deployment Strategy The existing learning-based congestion control methods need to collect large amount data from the interaction between agent and network environment to train the model, which may cost several hours [16], [32] to several days [2], [38], hindering the model from being really deployed. In our implementation, we propose a deployment strategy to achieve cold-start (i.e. the multi-staged process).
1) Initialization [a bootstrapping stage]: the neural networks A and Q are initialized by random parameters, and we set the default action for congestion window adjustment by cwnd = at ∗ Thrputmax ∗ RTTmin with at randomly drawn from {0.75, 1, 1.25}, which is similar to the window adjustment policy of BBR [3].
2) Exploration: in the beginning [an advancing stage], whenever an agent is triggered, with probability ρ it takes a default action (to mimic the behavior of BBR), and with probability 1 − ρ it takes the action return by the unoptimized policy network A (for exploration). We set ρ = 0.95 in our system. The transition experiences are stored in the replay buffer.
3) Training [a fairness training stage]: while the replay buffer is keeping filled by the observations, the agent is trained at runtime by a background daemon process using algorithm 2.
4) Execution: after training to convergence, we set the probability ρ = 0, so that the agent takes all actions generated by the policy network. The model parameters are periodically updated during execution.) (i.e. each segment associated with each stage of the multi-staged process).
Ma Li are considered to be analogous to the claimed invention because they are in the same field invention of using reinforcement learning methods. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Ma to incorporate the teachings of Li and disclose a multi-stage process progressively allow agents training starting from initial staring stage. Doing allows the agents to collect large amounts of data needed to train the model which would normally take days or longer to train (Li, C. Cold-Started Deployment Strategy, The existing learning-based congestion control methods need to collect large amount data from the interaction between agent and network environment to train the model, which may cost several hours [16], [32] to several days [2], [38], hindering the model from being really deployed. In our implementation, we propose a deployment strategy to achieve cold-start.).
Claim(s) 10, 19, 20 are rejected under 35 U.S.C. 103 as being unpatentable over Ma in view of Kim and Matiisen further in view Li.
Regarding claim 10, Ma as modified by Kim and Matiisen disclose all the elements of the method of claim 8.
Ma, as modified by Kim and Matiisen are combinable for the same rationale as set forth above with respect to claim 8.
Ma as modified by Kim and Matiisen fail to disclose wherein the multi-staged process further includes an online stage.
	However Li discloses, wherein the multi-staged process further includes an online stage ( Li Page 2468-2469, C. Cold-Started Deployment Strategy The existing learning-based congestion control methods need to collect large amount data from the interaction between agent and network environment to train the model, which may cost several hours [16], [32] to several days [2], [38], hindering the model from being really deployed. In our implementation, we propose a deployment strategy to achieve cold-start.
4) Execution [an online stage]: after training to convergence, we set the probability ρ = 0, so that the agent takes all actions generated by the policy network. The model parameters are periodically updated during execution.). 
With such training strategy, the DRL agent can be deployed in a cold-start way, which performs like BBR at the beginning, and then gradually move to the well-trained policy network.
V. Performance Evaluation A. Experiment Setup Para. 3,
Network scenarios: We test the CC algorithms based on three network scenarios.
Scenario 1 (dynamic network): represents the situation with varying bandwidth due to dynamic network traffic patterns. The link latency is fixed to 20ms, the loss rate is set to 0.1%, and the available bandwidth switches randomly between 40Mbps and 80Mbps every 5 seconds (i.e. online)).
A person of ordinary skill in the art would reasonably find the teaching of Li to be helpful in solving the problem of training deep reinforcement learning networks in Ma. In view of the teachings of Li it would have been obvious for a person of ordinary skill in the art to apply the teachings of Li to Ma before the effective filing date of the claimed invention to disclose a multi-stage process progressively allowing agents to learn to be implemented in an online environment. Doing allows the agents to collect large amounts of data needed to train the model which would normally take days or longer to train (Li Page 2468, C. Cold-Started Deployment Strategy, The existing learning-based congestion control methods need to collect large amount data from the interaction between agent and network environment to train the model, which may cost several hours [16], [32] to several days [2], [38], hindering the model from being really deployed. In our implementation, we propose a deployment strategy to achieve cold-start.).
Regarding claim 19, Ma and Kim disclose all the elements of the method of claim 18. 
Ma and Kim are combined in the same rational as set forth above with respect to claim 7 and analogous claim 18.
Ma and Sartoretti do not explicitly teach wherein [the shared experience replay buffer] is divided into multiple segments, each segment associated with each stage of [the multi-staged process], and wherein the multi-staged process includes a bootstrapping stage, an advancing stage, and a fairness training stage.
However Matiisen discloses wherein [the shared experience replay buffer] is divided into multiple segments, each segment associated with each stage of [the multi-staged process] (
Matiisen Page 3-5 3 Algorithms Para 2, Figure 2 is a demonstration of the ideal training progress in a curriculum learning setting (i.e. the multi-staged process): 1. At first, the Teacher has no knowledge so it samples from all tasks uniformly. 2. When the Student starts making progress on task 1, the Teacher allocates more probability mass to this task. 3. When the Student masters task 1, its learning curve flattens and the Teacher samples the task less often. At this point Student also starts making progress on task 2, so the Teacher samples more from task 2. 4. This continues until the Student masters all tasks. As all task learning curves flatten in the end, the Teacher returns to uniform sampling of the tasks.
3.4 Sampling algorithm
The previous algorithms require tuning of hyperparameters to balance exploration. To get rid of exploration hyperparameters, we take inspiration from Thompson sampling. The Sampling algorithm keeps a buffer (i.e. the shared experience replay buffer) of last K rewards for each task [divided into multiple segments]. To choose the next task, a recent reward is sampled from each task’s K-last-rewards buffer [each segment associated with each stage of the multi-staged process]. Then whichever task yielded the highest sampled reward is chosen. This makes exploration a natural part of the algorithm: tasks that have recently had high rewards are sampled more often.). Matiisen does not disclose and wherein the multi-staged process includes a bootstrapping stage, an advancing stage, and a fairness training stage.
However Li discloses, and wherein [the multi-staged process] includes a bootstrapping stage, an advancing stage, and a fairness training stage (Li Page 2468 C. Cold-Started Deployment Strategy The existing learning-based congestion control methods need to collect large amount data from the interaction between agent and network environment to train the model, which may cost several hours [16], [32] to several days [2], [38], hindering the model from being really deployed. In our implementation, we propose a deployment strategy to achieve cold-start (i.e. the multi-staged process).
1) Initialization [a bootstrapping stage]: the neural networks A and Q are initialized by random parameters, and we set the default action for congestion window adjustment by cwnd = at ∗ Thrputmax ∗ RTTmin with at randomly drawn from {0.75, 1, 1.25}, which is similar to the window adjustment policy of BBR [3].
2) Exploration: in the beginning [an advancing stage], whenever an agent is triggered, with probability ρ it takes a default action (to mimic the behavior of BBR), and with probability 1 − ρ it takes the action return by the unoptimized policy network A (for exploration). We set ρ = 0.95 in our system. The transition experiences are stored in the replay buffer.
3) Training [a fairness training stage]: while the replay buffer is keeping filled by the observations, the agent is trained at runtime by a background daemon process using algorithm 2.
4) Execution: after training to convergence, we set the probability ρ = 0, so that the agent takes all actions generated by the policy network. The model parameters are periodically updated during execution.).
The motivation utilized in the combination of claim 10 equally applies to claim 19.
Regarding claim 20, Ma in view of Kim and Matiisen and further in view of Li disclose all the elements of the method of claim 19.
Ma as modified by Kim and Matiisen do not explicitly teach wherein the multi-staged [process further includes an online stage].
However Li further discloses wherein [the multi-staged process] further includes an online stage ( Li Page 2468 C. Cold-Started Deployment Strategy The existing learning-based congestion control methods need to collect large amount data from the interaction between agent and network environment to train the model, which may cost several hours [16], [32] to several days [2], [38], hindering the model from being really deployed. In our implementation, we propose a deployment strategy to achieve cold-start.
4) Execution [an online stage]: after training to convergence, we set the probability ρ = 0, so that the agent takes all actions generated by the policy network. The model parameters are periodically updated during execution.). 
With such training strategy, the DRL agent can be deployed in a cold-start way, which performs like BBR at the beginning, and then gradually move to the well-trained policy network.
Network scenarios: We test the CC algorithms based on three network scenarios.
Scenario 1 (dynamic network): represents the situation with varying bandwidth due to dynamic network traffic patterns. The link latency is fixed to 20ms, the loss rate is set to 0.1%, and the available bandwidth switches randomly between 40Mbps and 80Mbps every 5 seconds (i.e. online).).
	The motivation utilized in the combination of claim 10 equally applies to claim 20.
Claim(s) 11 are rejected under 35 U.S.C. 103 as being unpatentable over Ma in view of Kim and further in view of Sartoretti and Schaul.
Regarding claim 11, Ma in view of Kim disclose all the elements of the method of claim 1 and analogous 12. 
Ma, Kim, and Saroretti are combined in the same rational as set forth above with respect to claim 4 and analogous claim 15.
Ma, Kim, and Schaul are combined in the same rational as set forth above with respect to claim 6 and analogous claim 17.
Ma does not explicitly teach wherein the multiple RL agents are deployed online and [the updated global policy] is distributed to the multiple RL agents at arbitrary intervals.
However Sartoretti discloses wherein the multiple RL agents are deployed online (Sartoretti Page 6 V. Results Para. 1 In this section, we present the results of an extensive set of simulations comparing PRIMAL against state-of-the-art MAPF planners in gridworlds. These tests are performed in environments with varying obstacle densities, grid sizes, and team sizes. Finally, we present experimental results for a scenario featuring both physical and simulated robots (i.e. multiple RL agents) planning paths online [are deployed online] in an indoor factory mockup). 
However Schaul discloses and [the updated global policy] is distributed to the multiple RL agents at arbitrary intervals (Schaul Page 1 1 INTRODUCTION Online reinforcement learning (RL) agents incrementally update their parameters (of the policy (i.e. the updated global policy), value function or model) while they observe a stream of experience. 
Page 4-5 Implementation: To efficiently sample from distribution (1), the complexity cannot depend on N. For the rank-based variant, we can approximate the cumulative density function with a piecewise linear function with k segments of equal probability. The segment boundaries can be precomputed (they change only when N or                         
                            α
                        
                     change). At runtime, we sample a segment, and then sample uniformly among the transitions within it. This works particularly well in conjunction with a minibatch-based learning algorithm: choose k to be the size of the minibatch, and sample exactly one transition from each segment – this is a form of stratified sampling that has the added advantage of balancing out the minibatch (there will always be exactly one transition with high magnitude                         
                            δ
                        
                    , one with medium magnitude, etc). The proportional variant is different, also admits an efficient implementation based on a ‘sum-tree’ data structure (where every node is the sum of its children, with the priorities as the leaf nodes), which can be efficiently updated and sampled from. See Appendix B.2.1 for more additional details.

    PNG
    media_image14.png
    589
    1027
    media_image14.png
    Greyscale

(i.e. line 16 distributed to the multiple RL agents at arbitrary intervals).
The motivation utilized in the combination of claim 6 equally applies to claim 11.
Pertinent Prior Art
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Wu et al. (US20210334441A1) teaches a Multi-Agent Reinforcement Learning method operating in a decentralized manner without any communication.
Zhang et al. (US20190347933A1) – teaches a multi-agent reinforcement learning reinforcement learning method without communication between agents as shown in FIG. 6. 
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ALFREDO CAMPOS whose telephone number is (571)272-4504. The examiner can normally be reached 7:00 - 4:00 pm M - F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael J. Huntley can be reached at (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ALFREDO CAMPOS/              Examiner, Art Unit 2129                                                                                                                                                                                          
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129
Read full office action
Prosecution Timeline

Show 2 earlier events
Aug 13, 2025
Response Filed
Sep 12, 2025
Final Rejection mailed — §103
Nov 11, 2025
Response after Non-Final Action
Dec 03, 2025
Request for Continued Examination
Dec 10, 2025
Response after Non-Final Action
Jan 14, 2026
Non-Final Rejection mailed — §103
Apr 14, 2026
Response Filed
Jun 02, 2026
Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/182,568
Patent 12682285
PROVIDING A SECURE AND COLLABORATIVE FEEDBACK MECHANISM FOR MACHINE LEARNING MODELS
3y 4m to grant Granted Jul 14, 2026
17/985,147
Patent 12651086
METHOD AND SERVER FOR DEFENDING SERVICE FROM PERSONAL PRIVACY INFERENCE ATTACK
3y 7m to grant Granted Jun 09, 2026
17/528,305
Patent 12561407
ONE-PASS APPROACH TO AUTOMATED TIMESERIES FORECASTING
4y 3m to grant Granted Feb 24, 2026
17/558,355
Patent 12561559
Neural Network Training Method and Apparatus, Electronic Device, Medium and Program Product
4y 2m to grant Granted Feb 24, 2026
17/820,419
Patent 12554973
HIERARCHICAL DATA LABELING FOR MACHINE LEARNING USING SEMI-SUPERVISED MULTI-LEVEL LABELING FRAMEWORK
3y 6m to grant Granted Feb 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

5-6
Expected OA Rounds
78%
Grant Probability
73%
With Interview (-5.0%)
3y 6m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 9 resolved cases by this examiner. Grant probability derived from career allowance rate.