Prosecution Insights
Last updated: April 19, 2026
Application No. 17/566,953

METHOD AND APPARATUS FOR MULTIPLE REINFORCEMENT LEARNING AGENTS IN A SHARED ENVIRONMENT

Non-Final OA §103
Filed
Dec 31, 2021
Examiner
CAMPOS, ALFREDO
Art Unit
2129
Tech Center
2100 — Computer Architecture & Software
Assignee
The Governing Council of the University of Toronto
OA Round
3 (Non-Final)
83%
Grant Probability
Favorable
3-4
OA Rounds
3y 9m
To Grant
99%
With Interview

Examiner Intelligence

Grants 83% — above average
83%
Career Allow Rate
5 granted / 6 resolved
+28.3% vs TC avg
Strong +33% interview lift
Without
With
+33.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 9m
Avg Prosecution
26 currently pending
Career history
32
Total Applications
across all art units

Statute-Specific Performance

§101
33.3%
-6.7% vs TC avg
§103
42.8%
+2.8% vs TC avg
§102
3.9%
-36.1% vs TC avg
§112
20.0%
-20.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 6 resolved cases

Office Action

§103
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Response to Arguments Applicant's arguments filed 11/11/2025 have been fully considered but they are not persuasive. Regarding 103 arguments in applicant’s remarks page 8-10, the applicants argues “Without conceding to the correctness of the Examiner's rejection but solely in order to expedite prosecution of the instant application, Applicant has amended claim 1 to further define "wherein fairness among the multiple RL agents operating in the shared computing environment is achieved through the updated global policy distributed to the multiple RL agents without inter- agent communication." Applicant has also amended claim 12 to substantially align with the above amendments to claim 1. Support for these amendments can be found throughout the application as originally filed. Applicant asserts that neither Ma nor Kim has disclosed, suggested, or hinted at "fairness among the multiple RL agents operating in the shared computing environment is achieved through the updated global policy distributed to the multiple RL agents without inter-agent communication," as defined in amended claims 1 and 12.” Applicant argues how claim 1 and analogous claim 12 are amended to overcome the art. However the amended limitations have not been examined rendering the argument moot. Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claim(s) 1, 5, 7, 12, 16, 18 are rejected under 35 U.S.C. 103 as being unpatentable over Ziyuan Ma, Yudong Luo, and Hang Ma. 2021. Distributed Heuristic Multi-Agent Path Finding with Communication. In 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE Press, 8699–8705 ("Ma") in view of Kim et al., Learning to Schedule Communication in Multi-agent Reinforcement Learning (2019) (“Kim”). Regarding claim 1 and analogous claim 12, Ma teaches a method for training multiple reinforcement learning (RL) agents deployed in a shared computing environment (Ma Page 8699 I. Introduction Para. 1, Multi-Agent Path Finding [1] is a path arrangement problem for a team of agents [Multiple reinforcement learning agents]. Each agent is required to move from its given start location to its given goal location while avoiding collisions with other agents. MAPF arises in many real world applications of multi-agent systems, such as ware house and office robots [2], [3], aircraft-towing vehicles [4], and video games. Page 8701 V. Learning Environment A. Environment Setup Para. 1 Line 1-7, We build a discrete grid world environment for MAPF, where each agent only has partial observability [Shared environment]. Many real world robot applications can be naturally transformed into partially observable grid world by each agent equipped with a radar sensor to localize itself and detect surroundings. Formally, the entire space is a m x m binary matrix, where 0 represents a free location and 1 is an obstacle.), the shared computing environment including one or more execution devices including one or more processors, one or more memories and [one or more communications interfaces] (Ma C. Multi-Agent Distributed Prioritized Experience Replay Para 3 line 1-3, Fig. 3 illustrates our system flow. In the experiments, we setup sixteen independent actors running on CPUs to generate data, and a single learner on GPU to train. Each actor has a copy of the environment with current Q-network and keep generating new transitions from multiple agents and initializing priorities for them [the shared computing environment including one or more execution devices including one or more processors]. The transitions from all actors are fed into a shared prioritized replay buffer [one or more memories]) (Examiner Note: The processors are considered to be execution devices and are themselves processors), the method comprising: executing, by the one or more execution devices, during an episode including one or more steps and associated with the shared computing environment, each of the multiple RL agents operating on the one or more execution devices and behaving based at least in part on a global policy stored in the one or more memories throughout the episode; (Ma Page 8703 VI. Distributed Heuristic learning with Communication C. Multi-Agent Distributed Prioritized Experience Replay Para 1. Line 10-16, In MAPF, agents has their individual goals instead of a common goal, thus IQL is more suitable for this problem compared with centralized MARL. The appealing merit of IQL is that it avoids the scalability problem in centralized training, which requires to learn a Q-function for joint actions over all agents. The joint action space grows exponentially as the number of agents increases. On the other hand, IQL is naturally appropriate to learn decentralized policy for partially observable settings, because each agent makes decision only based on its own observation. As each agent in the MAPF environment plays the same role as others, to simplify the training process, instead of learning multiple policies for multiple agents, we train a single model from a single agent’s perspective while treating others as part of its environment. The final trained policy [Global Policy] can be applied to each agent [multiple RL agents] for decentralized execution [behaving based at least in part on a global policy]. (i.e. each of the RL agents acts based on the global policy). VII. Experiments Para. 3, For other parameter settings during training, the obstacle density of the environment is sampled from a triangular distribution between 0 and 0.5 with a peak at 0.33 (same as PRIMAL). The FOV size is 9x9 (10x10 in PRIMAL, we make it odd) [shared environment]. The maximum episode length is 256 (i.e. each episode has multiple steps). We train the network with a batch size of 192 and a sequence length of 20 (limit by memory). We use a dynamic learning rate beginning at 10^4 and decreasing by fifty percent at 100k step and 300k step. The maximum training step is 500k); creating, by a processor of the one or more execution devices associated with each of the multiple RL agents during the episode, experience tuples, each experience tuple created at an end of each step; and (Ma Page 8701 D. Reward Design Motivated by the common reward design that agents are punished every step [at the end of each step] for not staying on goal to facilitate goal reaching, we design our reward function as shown in Table I. Different from PRIMAL and MAPPER, where agents are penalized more for staying still, we treat every movement and staying (if not on the goal) as the same because in complex cases, one agent should stop and let another agent pass first in order to avoid collision [each of the multiple RL agents during the episode, experience tuples, each experience tuple created at an end of each step;]. Page 8703 C. Multi-Agent Distributed Prioritized Experience Replay Para 3 line 1-3, Fig. 3 illustrates our system flow. In the experiments, we setup sixteen independent actors running on CPUs to generate data, and a single learner on GPU to train [by a processor of the one or more execution devices associated]); storing, by the the one or more execution devices associated with each of the multiple RL agents, the experience tuples in a shared experience replay buffer [via the one or more communications interfaces], the shared experience replay buffer being implemented in a block of memory that stores the experience tuples and shared by the multiple RL agents [via the one or more communications interfaces] throughout the episode and a next episode (Ma Page 8703 C. Multi-Agent Distributed Prioritized Experience Replay Para. 3 Fig. 3 illustrates our system flow. In the experiments, we setup sixteen independent actors running on CPUs to generate data, and a single learner on GPU to train. Each actor has a copy of the environment with current Q-network and keep generating new transitions from multiple agents and initializing priorities for them. The transitions from all actors are fed into a shared prioritized replay buffer [storing, by the processor of the one or more execution devices associated with each of the multiple RL agents, the experience tuples in a shared experience replay buffer,]. Then learner samples the most useful experiences from the buffer and updates the network and priorities of the experience. Note that although the model is trained for a single agent, the transitions of all the agents need to be stored for communication purpose, and the priorities are initialized and updated from that agent’s perspective. As priorities are shared, the good experiences explored by any actor can improve the learner [, the shared experience replay buffer being implemented in a block of memory that stores the experience tuples and shared by the multiple RL agents] [throughout the episode and a next episode])). wherein after the episode the method further includes: updating, by (Ma Page 8703 C. Multi-Agent Distributed Prioritized Experience Replay Para. 3 Line 1-14, Fig. 3 illustrates our system flow. In the experiments, we setup sixteen independent actors running on CPUs to generate data, and a single learner on GPU to train [by a processor the one or more execution devices of one of the multiple RL agents].The transitions from all actors are fed into a shared prioritized replay buffer. Then learner samples the most useful experiences [experience tuples] from the buffer and updates the network and priorities of the experience. Note that although the model is trained for a single agent, the transitions of all the agents need to be stored for communication purpose, and the priorities are initialized and updated from that agent’s perspective. As priorities are shared, the good experiences explored by any actor can improve the learner [the global policy based on sampled experience tuples drawn from the shared experience replay buffer]), distributing, [by a the one or more communications interfaces of one of the multiple RL agents], the updated global policy to the multiple RL agents, wherein each of the multiple RL agents behave based at least in part on the updated global policy in the next episode, ( PNG media_image1.png 222 569 media_image1.png Greyscale PNG media_image2.png 182 443 media_image2.png Greyscale (i.e. the updated parameters are the updated policy). Ma does not explicitly teach [the shared computing environment including one or more execution devices including one or more processors, one or more memories and] one or more communications interfaces, [storing, by the processor of the one or more execution devices associated with each of the multiple RL agents, the experience tuples in a shared experience replay buffer] via the one or more communications interfaces, [the shared experience replay buffer being implemented in a block of memory that stores the experience tuples and shared by the multiple RL agents] via the one or more communications interfaces [throughout the episode and a next episode] [distributing,] by a the one or more communications interfaces of one of the multiple RL agents, [the updated global policy to the multiple RL agents], wherein each of the multiple RL agents behave based at least in part on [the updated global policy] in the next episode, wherein fairness among the multiple RL agents operating in the shared computing environment is achieved through the updated global policy distributed to the multiple RL agents without inter- agent communication. However Kim teaches [the shared computing environment including one or more execution devices including one or more processors, one or more memories and] one or more communications interfaces (Page 2 Contributions line 7-12, While prior work on MARL to date considers only the limited bandwidth constraint, we additionally address the shared medium contention issue in what we believe is the first work of its kind: which nodes are granted access to the shared medium. Intuitively, nodes with more important observations should be chosen, for which we adopt a simple yet powerful mechanism called weight-based scheduler (WSA), designed to reconcile simplicity in training with integrity of reflecting real-world MAC protocols in use (e.g., 802.11 Wi-Fi) [one or more communications interfaces])), [storing, by the processor of the one or more execution devices associated with each of the multiple RL agents, the experience tuples in a shared experience replay buffer] via the one or more communications interfaces, [the shared experience replay buffer being implemented in a block of memory that stores the experience tuples and shared by the multiple RL agents] via the one or more communications interfaces [throughout the episode and a next episode] (Since distributed execution is one of our major operational constraints in SchedNet or other CTDEbased MARL algorithms, Top(k) and Softmax(k) should be realizable via a weight-based mechanism in a distributed manner. In fact, this has been an active research topic to date in wireless networking, where many algorithms exist (Tassiulas & Ephremides, 1992; Yi et al., 2008; Jiang & Walrand, 2010). Due to space limitation, we present how to obtain distributed versions of those two rules based on weights in our supplementary material. To summarize, using so-called CSMA (Carrier Sense Multiple Access) (Kurose, 2005), which is a fully distributed MAC scheduler and forms a basis of Wi-Fi, given agents’ weight values, it is possible to implement Top(k) and Softmax(k). Our goal is to train agents so that every time each agent takes an action, only Ksched agents can broadcast their messages with limited size Lband with the goal of receiving the highest cumulative reward via cooperation. Each agent should determine a policy described by its scheduling weights, encoded communication messages, and actions Page 12 Algorithm 1, PNG media_image3.png 224 741 media_image3.png Greyscale (i.e. the method trains agents in in multi-agent reinforcement learning method that stores agents experiences via communication devices).); [distributing,] by a the one or more communications interfaces of one of the multiple RL agents, [the updated global policy to the multiple RL agents, wherein each of the multiple RL agents behave based at least in part on the updated global policy in the next episode], wherein fairness among the multiple RL agents operating in the shared computing environment is achieved through the updated global policy distributed to the multiple RL agents without inter-agent communication (Kim (Kim Page 5 1 Introduction Para 4, Contributions In this paper, we propose a new deep multi-agent reinforcement learning architecture called SchedNet, with the rationale of centralized training and distributed execution in order to achieve a common goal better via decentralized cooperation. During distributed execution, agents are allowed to communicate over wireless channels where messages are broadcast to all agents in each agent’s communication range. This broadcasting feature of wireless communication necessitates a Medium Access Control (MAC) protocol to arbitrate contending communicators in a shared medium. CSMA (Collision Sense Multiple Access) in Wi-Fi is one such MAC protocol [communication interface] Page 12, PNG media_image4.png 714 802 media_image4.png Greyscale ), Page 17 D.1 Carrier Sense Multiple Access (CSMA), Depending on how to choose the backoff and holding times, there can be many variants of CSMA that work for various purposes such as fairness and throughput. Two examples of these, Top(k) and Softmax(k), are introduced in the following sections. D.2 A version of distributed Top (K), In this subsection, we introduce a simple distributed scheduling algorithm, called Distributed Top(k), which can work with SchedNet-Top(k). It is based on CSMA where each sender determines backoff and holding times as follows. In SchedNet, each agent generates the scheduling weight w based on its own observation. The agent sets its backoff time as 1 - w where w is its schedule weight, and it waits for backoff time before it tries to broadcast its message. Once it successfully broadcasts the message, it immediately releases the channel. Thus, the agent with the highest w can grab the channel in a decentralized manner without any message passing. By repeating this for k times, we can realize decentralized Top(k) scheduling [wherein fairness among the multiple RL agents operating in the shared computing environment is achieved through the updated global policy distributed to the multiple RL agents without inter- agent communication] (Examiner Note: The policy is updated by updating the critic that updates the actor and it is done without inter-agent communication)) Ma and Kim are both considered to be analogous to the claimed invention because they are in the same field invention of using multi-agent reinforcement learning. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Ma to incorporate the teachings of Kim and disclose communicating between agents in a fair method. Doing so to ensure that agents are able to make sequential decision that accelerate coordination in a distributed manner and behave as a group (Kim Abstract, Many real-world reinforcement learning tasks require multiple agents to make sequential decisions under the agents’ interaction, where well-coordinated actions among the agents are crucial to achieve the target goal better at these tasks. One way to accelerate the coordination effect is to enable multiple agents to communicate with each other in a distributed manner and behave as a group. In this paper, we study a practical scenario when (i) the communication bandwidth is limited and (ii) the agents share the communication medium so that only a restricted number of agents are able to simultaneously use the medium, as in the state-of-the-art wireless networking standards. This calls for a certain form of communication scheduling. In that regard, we propose a multi-agent deep reinforcement learning framework, called SchedNet, in which agents learn how to schedule themselves, how to encode the messages, and how to select actions based on received messages. SchedNet is capable of deciding which agents should be entitled to broadcasting their (encoded) messages, by learning the importance of each agent’s partially observed information. We evaluate SchedNet against multiple baselines under two different applications, namely, cooperative communication and navigation, and predator-prey. Our experiments show a non-negligible performance gap between SchedNet and other mechanisms such as the ones without communication and with vanilla scheduling methods, e.g., round robin, ranging from 32% to 43%.). Regarding claim 5 and analogous 16, Ma in view of Kim disclose all the elements of the method of claim 1 and analogous 12. Ma further discloses wherein the shared experience buffer is a prioritized experience replay buffer ( Ma Page. 8703 Page 8703 VI. Distributed Heuristic Learning with Communication C. Multi-Agent Distributed Prioritized Experience Replay Para. 3 line 6-7, The transitions from all actors are fed into a shared prioritized replay buffer [a prioritized experience replay buffer]). Regarding claim 7 and analogous 18, Ma in view of Kim disclose all the elements of the method of claim 5 and analogous 16. Ma further discloses wherein the multiple RL agents are trained using a multi-staged process, the multi-staged process progressing in difficulty from stage to stage, each stage adding a challenging characteristic associated with the shared computing environment (Ma Page 8703 VII. Experiments Para. 1, Learning directly from a large size environment with lots of agents [the multiple RL agents] is hard. Instead, we use curriculum learning method by gradually introducing more difficult tasks [multi-staged process progressing in difficulty] to agents [20]. Starting from a easy task with only one agent in a 10x10 environment, we establish two new challenging tasks for agents by increasing the agent amount by one or increasing the environment size by five [new challenge characteristic associated with the shared environment], if the success rate of the current task exceeds 0.9. As the training scale grows, the final task with twelve agents in a 40 x 40 environment is reached. Claim(s) 3, and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Ma in view of Kim and further in view of Budden et al. (US 2020/0265305 A1) (“Budden”). Regarding claim 3 and analogous 14, Ma in view of Kim disclose all the elements of the method of claim 1 and analogous 12. Ma and Kim are combined in the same rational as set forth above with respect to claim 1 and analogous claim 12. Ma does not explicitly disclose wherein each [experience tuple] includes a state of [the shared environment] at the beginning of each step, an action taken during each step, a state of the shared environment at the end of each step and a reward obtained at the end of each step. However Budden discloses wherein each [experience tuple] includes a state of [the shared environment computing] at the beginning of each step, an action taken during each step, a state of the shared environment computing at the end of each step and a reward obtained at the end of each step (Budden Para. 0046, The actor computing unit 112 is configured to receive an observation characterizing a current state of the environment instance 132 and to select an action to be performed by the agent 102 using the action selection neural network replica 118 and in accordance with current values of the network parameters . The actor computing unit 112 then obtains transition data characterizing the environment instance subsequent to the agent 102 performing the selected action . For example , the transition data includes the next observation characterizing the next state of the environment instance 132 , i.e. , the state that the environment instance 132 transitioned into as a result of the agent performing the selected action . The transition data also includes a reward which is a numeric value that is received from the environment as a result of the agent 102 performing the selected action . The actor computing unit 112 then generates an experience tuple from the observation , the selected action , and the transition data. An experience tuple , which can be denoted as ( S t , A t , r t + i + 1 , S t + 1 ) , includes the current observation S t [a state of the shared environment], the selected action A t [an action taken during each step], and the reward r t + i + 1 [a reward obtained at the end of each step], and the next observation S t + 1 [a state of the shared environment at the end of each step] that characterizes the next state of the environment after the selected action was performed ). Ma and Budden are considered to be analogous to the claimed invention because they are in the same field invention of using multi-agent reinforcement learning methods. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Ma to incorporate the teachings of Budden and disclose an experience tuple with the state of share environment, the action taken, the reward, and the end state. Doing so to allow the distributed architecture to work in different systems without limiting performance. (Budden Para. 0009 Line 8 - 24, In addition , by sharing experiences instead of sharing gradients as in some conventional schemes , the system can batch all communications with the centralized replay memory (for example , the system can select batches comprising multiple experience tuples from the replay memory, and use the batch to update the neural network), thereby increasing the efficiency and throughput of the system . With this approach, it is possible for actors and learners to run in different data - centers without limiting performance). Claim(s) 4, and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Ma in view of Kim and further in view of Sartoretti et al., PRIMAL: Pathfinding via Reinforcement and Imitation Multi-Agent Learning, (2019) (“Sartoretti”). Regarding claim 4 analogous claim 15, Ma in view of Kim disclose all the elements of the method of claim 1 and analogous 12. Ma and Kim are combined in the same rational as set forth above with respect to claim 1 and analogous claim 12. Ma further teaches wherein the global policy is updated in a form of gradient descent [such that an adjustment is made to one or more parameters of the global policy based at least in part on gradients of a loss function with respect to each of the one or more parameters] (Ma Page 8703 VI. Distributed Heuristic Learning with Communication C. Multi-Agent Distributed Prioritized Experience Replay Para 1 Para. 4, The final loss function is a multi-step TD error PNG media_image5.png 28 303 media_image5.png Greyscale [the global policy is updated in a form of gradient descent] with R t = r t + γ r t + 1 + … + γ n Q s t + n , a t + n , θ - , where R t is the total return of the agent we care about, s t and a t are the state and action of that agent, and θ ¯ denotes the parameters of the target network, a periodical copy of the online parameters θ . Ma however does not explicitly teach [wherein the global policy is updated in a form of gradient descent] such that an adjustment is made to one or more parameters of the global policy based at least in part on gradients of a loss function with respect to each of the one or more parameters] However Sartoretti [wherein the global policy is updated in a form of gradient descent] such that an adjustment is made to one or more parameters of the global policy based at least in part on gradients of a loss function with respect to each of the one or more parameters (To update the policy, we use an approximation of the advantage function by bootstrapping using the value function: PNG media_image6.png 27 436 media_image6.png Greyscale (where k is bounded by the batch size T). We also add an entropy term H( π (o)) to the policy loss, which has been shown to encourage exploration and discourage premature convergence [34] by penalizing a policy that always chooses the same actions. The policy loss reads PNG media_image7.png 62 473 media_image7.png Greyscale with a small entropy weight σ H ( σ H = 0.01 in practice). We rely on two additional loss functions which help to guide and stabilize training. First, the blocking prediction output is updated by minimizing Lblocking, the log likelihood of predicting incorrectly. Second, we define the loss function Lvalid to minimize the log likelihood of selecting an invalid move [11] [such that an adjustment is made to one or more parameters of the global policy based at least in part on gradients of a loss function with respect to each of the one or more parameters], as mentioned in Section III-B.). Ma and Sartoretti are both considered to be analogous to the claimed invention because they are in the same field invention of using multi-agent reinforcement learning. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Ma to incorporate the teachings of Sartoretti and disclose adjusting one or more parameters of the global policy based on the loss function. Doing so to ensure that agents are able to make moves that benefit not just themselves but others. (Sartoretti I. Introduction Para. 2 Line 6-15, In this framework, agents learn to take into account the consequences of their position on other agents, in order to favor movements that will benefit the whole team and not only themselves. That is, by simultaneously learning to plan efficient single-agent paths (mostly via RL), and to imitate a centralized expert (IL), agents ultimately learn a decentralized policy where they still exhibit implicit coordination during online path planning without the need for explicit communication among agents.). Claim(s) 6 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Ma in view of Kim and further in view of Schaul et al, Prioritized Experience Replay, (2016) (“Schaul”). Regarding claim 6 analogous 17, Ma in view of Kim disclose all the elements of the method of claim 5 and analogous 16. Ma and Kim are combined in the same rational as set forth above with respect to claim 1 and analogous claim 12. Ma further discloses wherein the sampled experience tuples are drawn from the shared experience replay buffer [based on unexpectedness] of each experience tuple stored in the shared experience buffer (Ma Page 8703 VI. Distributed Heuristic Learning with Communication C. Multi-Agent Distributed Prioritized Experience Para. 3 Line 6-13, The transitions from all actors are fed into a shared prioritized replay buffer [each experience tuple stored in the shared experience buffer]. Then learner samples the most useful experiences [experience tuples] from the buffer [shared experience replay buffer]). Then learner samples the most useful experiences from the buffer and updates the network and priorities of the experience. Note that although the model is trained for a single agent, the transitions of all the agents need to be stored for communication purpose, and the priorities are initialized and updated from that agent’s perspective). Ma does not explicitly disclose [wherein the sampled experience tuples are drawn from the shared experience replay buffer] based on unexpectedness [of each experience tuple stored in the shared experience buffer] based on unexpectedness of each [experience tuple stored in the shared experience buffer], the unexpectedness of each experience tuple determined based on the global policy prior to updating. However Schaul discloses [wherein the sampled experience tuples are drawn from the shared experience replay buffer] based on unexpectedness of each [experience tuple stored in the shared experience buffer], the unexpectedness of each experience tuple determined based on the global policy prior to updating (Schaul et al. Page 4 Para 1 to 2, page 5 Algorithm 1, However, greedy TD-error prioritization has several issues. First, to avoid expensive sweeps over the entire replay memory, TD errors are only updated for the transitions (i.e. experience tuple) that are replayed. One consequence is that transitions that have a low TD error on first visit may not be replayed for a long time (which means effectively never with a sliding window replay memory). Further, it is sensitive to noise spikes (e.g. when rewards are stochastic), which can be exacerbated by bootstrapping, where approximation errors appear as another source of noise. Finally, greedy prioritization focuses on a small subset of the experience: errors shrink slowly, especially when using function approximation, meaning that the initially high error transitions get replayed frequently. This lack of diversity that makes the system prone to over-fitting. To overcome these issues, we introduce a stochastic sampling method [unexpectedness] that interpolates between pure greedy prioritization and uniform random sampling. We ensure that the probability of being sampled is monotonic in a transition’s priority, while guaranteeing a non-zero probability even for the lowest-priority transition. Concretely, we define the probability of sampling transition i as PNG media_image8.png 95 609 media_image8.png Greyscale where p i > 0 is the priority of transition i. The exponent α determines how much prioritization is used, with α = 0 corresponding to the uniform case. PNG media_image9.png 593 1042 media_image9.png Greyscale [the unexpectedness of each experience tuple determined based on the global policy prior to updating]). Ma and Schaul are considered to be analogous to the claimed invention because they are in the same field of invention of reinforcement learning methods. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Ma to incorporate the teachings of Schaul and disclose drawing experience tuples based on stochastic process. Doing so to increase performance and avoid expensive sweeps over the entire replay memory. (Schaul Page 3 3.3 Stochastic Prioritization, However, greedy TD-error prioritization has several issues. First, to avoid expensive sweeps over the entire replay memory, TD errors are only updated for the transitions that are replayed.) Claim(s) 8 are rejected under 35 U.S.C. 103 as being unpatentable over Ma in view of Kim and further in view of Matiisen et al. Teacher-Student Curriculum Learning (2017) (Matiisen). Regarding claim 8, Ma in view of Kim disclose all the elements of the method of claim 7 and analogous 18. Ma and Kim are combined in the same rational as set forth above with respect to claim 1 and analogous claim 12. Ma does not explicitly teach wherein [the shared experience replay buffer] is divided into multiple segments, each segment associated with each stage of [the multi-staged process]. However Matiisen discloses wherein [the shared experience replay buffer] is divided into multiple segments, each segment associated with each stage of [the multi-staged process] ( Matiisen Page 3-5 3 Algorithms Para 2, Figure 2 is a demonstration of the ideal training progress in a curriculum learning setting (i.e. multi-stage process): 1. At first, the Teacher has no knowledge so it samples from all tasks uniformly. 2. When the Student starts making progress on task 1, the Teacher allocates more probability mass to this task. 3. When the Student masters task 1, its learning curve flattens and the Teacher samples the task less often. At this point Student also starts making progress on task 2, so the Teacher samples more from task 2. 4. This continues until the Student masters all tasks. As all task learning curves flatten in the end, the Teacher returns to uniform sampling of the tasks. 3.4 Sampling algorithm The previous algorithms require tuning of hyperparameters to balance exploration. To get rid of exploration hyperparameters, we take inspiration from Thompson sampling. The Sampling algorithm keeps a buffer of last K rewards for each task [divided into multiple segments]. To choose the next task, a recent reward is sampled from each task’s K-last-rewards buffer [each segment associated with each stage of the multi-staged process]. Then whichever task yielded the highest sampled reward is chosen. This makes exploration a natural part of the algorithm: tasks that have recently had high rewards are sampled more often.). Ma and Matiisen are considered to be analogous to the claimed invention because they are in the same field invention of using reinforcement learning methods. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Ma to incorporate the teachings of Matiisen and disclose dividing the buffer into multiple sections that contain pervious stage experiences. Doing so to allow the agents to learn from simpler task when performing harder task. (Matiisen 1 Introduction Para 1 line 7-10, where tasks are ordered by increasing difficulty and training only proceeds to harder tasks once easier ones are mastered. Curriculum learning helps when after mastering a simpler task the policy for a harder task is discoverable through random exploration). Claim(s) 9 are rejected under 35 U.S.C. 103 as being unpatentable over Ma in view of Kim and further in view Li et al., "TCP-NeuRoc: Neural Adaptive TCP Congestion Control With Online Changepoint Detection," in IEEE Journal on Selected Areas in Communications, vol. 39, no. 8, pp. 2461-2475, Aug. 2021, (2021) (“Li”). Regarding claim 9, Ma in view of Kim disclose all the elements of the method of claim 7 and analogous claim 17. Ma and Kim are combined in the same rational as set forth above with respect to claim 1 and analogous claim 12. Ma does not explicitly teach wherein the multi-staged process [includes a bootstrapping stage, an advancing stage, a fairness training stage]. However Li teaches wherein [the multi-staged process] includes a bootstrapping stage, an advancing stage, a fairness training stage. (Li Page 2468 C. Cold-Started Deployment Strategy The existing learning-based congestion control methods need to collect large amount data from the interaction between agent and network environment to train the model, which may cost several hours [16], [32] to several days [2], [38], hindering the model from being really deployed. In our implementation, we propose a deployment strategy to achieve cold-start (i.e. the multi-staged process). 1) Initialization [a bootstrapping stage]: the neural networks A and Q are initialized by random parameters, and we set the default action for congestion window adjustment by cwnd = at ∗ Thrputmax ∗ RTTmin with at randomly drawn from {0.75, 1, 1.25}, which is similar to the window adjustment policy of BBR [3]. 2) Exploration: in the beginning [an advancing stage], whenever an agent is triggered, with probability ρ it takes a default action (to mimic the behavior of BBR), and with probability 1 − ρ it takes the action return by the unoptimized policy network A (for exploration). We set ρ = 0.95 in our system. The transition experiences are stored in the replay buffer. 3) Training [a fairness training stage]: while the replay buffer is keeping filled by the observations, the agent is trained at runtime by a background daemon process using algorithm 2. 4) Execution: after training to convergence, we set the probability ρ = 0, so that the agent takes all actions generated by the policy network. The model parameters are periodically updated during execution.) (i.e. each segment associated with each stage of the multi-staged process). Ma Li are considered to be analogous to the claimed invention because they are in the same field invention of using reinforcement learning methods. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Ma to incorporate the teachings of Li and disclose a multi-stage process progressively allow agents training starting from initial staring stage. Doing allows the agents to collect large amounts of data needed to train the model which would normally take days or longer to train (Li, C. Cold-Started Deployment Strategy, The existing learning-based congestion control methods need to collect large amount data from the interaction between agent and network environment to train the model, which may cost several hours [16], [32] to several days [2], [38], hindering the model from being really deployed. In our implementation, we propose a deployment strategy to achieve cold-start.). Claim(s) 10, 19, 20 are rejected under 35 U.S.C. 103 as being unpatentable over Ma in view of Kim and Matiisen further in view Li. Regarding claim 10, Ma as modified by Kim and Matiisen disclose all the elements of the method of claim 8. Ma, as modified by Kim and Matiisen are combinable for the same rationale as set forth above with respect to claim 8. Ma as modified by Kim and Matiisen fail to disclose wherein the multi-staged process further includes an online stage. However Li discloses, wherein the multi-staged process further includes an online stage ( Li Page 2468-2469, C. Cold-Started Deployment Strategy The existing learning-based congestion control methods need to collect large amount data from the interaction between agent and network environment to train the model, which may cost several hours [16], [32] to several days [2], [38], hindering the model from being really deployed. In our implementation, we propose a deployment strategy to achieve cold-start. 4) Execution [an online stage]: after training to convergence, we set the probability ρ = 0, so that the agent takes all actions generated by the policy network. The model parameters are periodically updated during execution.). With such training strategy, the DRL agent can be deployed in a cold-start way, which performs like BBR at the beginning, and then gradually move to the well-trained policy network. V. Performance Evaluation A. Experiment Setup Para. 3, Network scenarios: We test the CC algorithms based on three network scenarios. Scenario 1 (dynamic network): represents the situation with varying bandwidth due to dynamic network traffic patterns. The link latency is fixed to 20ms, the loss rate is set to 0.1%, and the available bandwidth switches randomly between 40Mbps and 80Mbps every 5 seconds (i.e. online)). A person of ordinary skill in the art would reasonably find the teaching of Li to be helpful in solving the problem of training deep reinforcement learning networks in Ma. In view of the teachings of Li it would have been obvious for a person of ordinary skill in the art to apply the teachings of Li to Ma before the effective filing date of the claimed invention to disclose a multi-stage process progressively allowing agents to learn to be implemented in an online environment. Doing allows the agents to collect large amounts of data needed to train the model which would normally take days or longer to train (Li Page 2468, C. Cold-Started Deployment Strategy, The existing learning-based congestion control methods need to collect large amount data from the interaction between agent and network environment to train the model, which may cost several hours [16], [32] to several days [2], [38], hindering the model from being really deployed. In our implementation, we propose a deployment strategy to achieve cold-start.). Regarding claim 19, Ma and Kim disclose all the elements of the method of claim 18. Ma and Kim are combined in the same rational as set forth above with respect to claim 7 and analogous claim 18. Ma and Sartoretti do not explicitly teach wherein [the shared experience replay buffer] is divided into multiple segments, each segment associated with each stage of [the multi-staged process], and wherein the multi-staged process includes a bootstrapping stage, an advancing stage, and a fairness training stage. However Matiisen discloses wherein [the shared experience replay buffer] is divided into multiple segments, each segment associated with each stage of [the multi-staged process] ( Matiisen Page 3-5 3 Algorithms Para 2, Figure 2 is a demonstration of the ideal training progress in a curriculum learning setting (i.e. the multi-staged process): 1. At first, the Teacher has no knowledge so it samples from all tasks uniformly. 2. When the Student starts making progress on task 1, the Teacher allocates more probability mass to this task. 3. When the Student masters task 1, its learning curve flattens and the Teacher samples the task less often. At this point Student also starts making progress on task 2, so the Teacher samples more from task 2. 4. This continues until the Student masters all tasks. As all task learning curves flatten in the end, the Teacher returns to uniform sampling of the tasks. 3.4 Sampling algorithm The previous algorithms require tuning of hyperparameters to balance exploration. To get rid of exploration hyperparameters, we take inspiration from Thompson sampling. The Sampling algorithm keeps a buffer (i.e. the shared experience replay buffer) of last K rewards for each task [divided into multiple segments]. To choose the next task, a recent reward is sampled from each task’s K-last-rewards buffer [each segment associated with each stage of the multi-staged process]. Then whichever task yielded the highest sampled reward is chosen. This makes exploration a natural part of the algorithm: tasks that have recently had high rewards are sampled more often.). Matiisen does not disclose and wherein the multi-staged process includes a bootstrapping stage, an advancing stage, and a fairness training stage. However Li discloses, and wherein [the multi-staged process] includes a bootstrapping stage, an advancing stage, and a fairness training stage (Li Page 2468 C. Cold-Started Deployment Strategy The existing learning-based congestion control methods need to collect large amount data from the interaction between agent and network environment to train the model, which may cost several hours [16], [32] to several days [2], [38], hindering the model from being really deployed. In our implementation, we propose a deployment strategy to achieve cold-start (i.e. the multi-staged process). 1) Initialization [a bootstrapping stage]: the neural networks A and Q are initialized by random parameters, and we set the default action for congestion window adjustment by cwnd = at ∗ Thrputmax ∗ RTTmin with at randomly drawn from {0.75, 1, 1.25}, which is similar to the window adjustment policy of BBR [3]. 2) Exploration: in the beginning [an advancing stage], whenever an agent is triggered, with probability ρ it takes a default action (to mimic the behavior of BBR), and with probability 1 − ρ it takes the action return by the unoptimized policy network A (for exploration). We set ρ = 0.95 in our system. The transition experiences are stored in the replay buffer. 3) Training [a fairness training stage]: while the replay buffer is keeping filled by the observations, the agent is trained at runtime by a background daemon process using algorithm 2. 4) Execution: after training to convergence, we set the probability ρ = 0, so that the agent takes all actions generated by the policy network. The model parameters are periodically updated during execution.). The motivation utilized in the combination of claim 10 equally applies to claim 19. Regarding claim 20, Ma in view of Kim and Matiisen and further in view of Li disclose all the elements of the method of claim 19. Ma as modified by Kim and Matiisen do not explicitly teach wherein the multi-staged [process further includes an online stage]. However Li further discloses wherein [the multi-staged process] further includes an online stage ( Li Page 2468 C. Cold-Started Deployment Strategy The existing learning-based congestion control methods need to collect large amount data from the interaction between agent and network environment to train the model, which may cost several hours [16], [32] to several days [2], [38], hindering the model from being really deployed. In our implementation, we propose a deployment strategy to achieve cold-start. 4) Execution [an online stage]: after training to convergence, we set the probability ρ = 0, so that the agent takes all actions generated by the policy network. The model parameters are periodically updated during execution.). With such training strategy, the DRL agent can be deployed in a cold-start way, which performs like BBR at the beginning, and then gradually move to the well-trained policy network. Network scenarios: We test the CC algorithms based on three network scenarios. Scenario 1 (dynamic network): represents the situation with varying bandwidth due to dynamic network traffic patterns. The link latency is fixed to 20ms, the loss rate is set to 0.1%, and the available bandwidth switches randomly between 40Mbps and 80Mbps every 5 seconds (i.e. online).). The motivation utilized in the combination of claim 10 equally applies to claim 20. Claim(s) 11 are rejected under 35 U.S.C. 103 as being unpatentable over Ma in view of Kim and further in view of Sartoretti and Schaul. Regarding claim 11, Ma in view of Kim disclose all the elements of the method of claim 1 and analogous 12. Ma, Kim, and Saroretti are combined in the same rational as set forth above with respect to claim 4 and analogous claim 15. Ma, Kim, and Schaul are combined in the same rational as set forth above with respect to claim 6 and analogous claim 17. Ma does not explicitly teach wherein the multiple RL agents are deployed online and [the updated global policy] is distributed to the multiple RL agents at arbitrary intervals. However Sartoretti discloses wherein the multiple RL agents are deployed online (Sartoretti Page 6 V. Results Para. 1 In this section, we present the results of an extensive set of simulations comparing PRIMAL against state-of-the-art MAPF planners in gridworlds. These tests are performed in environments with varying obstacle densities, grid sizes, and team sizes. Finally, we present experimental results for a scenario featuring both physical and simulated robots (i.e. multiple RL agents) planning paths online [are deployed online] in an indoor factory mockup). However Schaul discloses and [the updated global policy] is distributed to the multiple RL agents at arbitrary intervals (Schaul Page 1 1 INTRODUCTION Online reinforcement learning (RL) agents incrementally update their parameters (of the policy (i.e. the updated global policy), value function or model) while they observe a stream of experience. Page 4-5 Implementation: To efficiently sample from distribution (1), the complexity cannot depend on N. For the rank-based variant, we can approximate the cumulative density function with a piecewise linear function with k segments of equal probability. The segment boundaries can be precomputed (they change only when N or α change). At runtime, we sample a segment, and then sample uniformly among the transitions within it. This works particularly well in conjunction with a minibatch-based learning algorithm: choose k to be the size of the minibatch, and sample exactly one transition from each segment – this is a form of stratified sampling that has the added advantage of balancing out the minibatch (there will always be exactly one transition with high magnitude δ , one with medium magnitude, etc). The proportional variant is different, also admits an efficient implementation based on a ‘sum-tree’ data structure (where every node is the sum of its children, with the priorities as the leaf nodes), which can be efficiently updated and sampled from. See Appendix B.2.1 for more additional details. PNG media_image10.png 589 1027 media_image10.png Greyscale (i.e. line 16 distributed to the multiple RL agents at arbitrary intervals). The motivation utilized in the combination of claim 6 equally applies to claim 11. Pertinent Prior Art The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Wu et al. (US20210334441A1) teaches a Multi-Agent Reinforcement Learning method operating in a decentralized manner without any communication. Conclusion Any inquiry concerning this communication or earlier communications from the examiner should be directed to ALFREDO CAMPOS whose telephone number is (571)272-4504. The examiner can normally be reached 7:00 - 4:00 pm M - F. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael J. Huntley can be reached at (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /ALFREDO CAMPOS/ Examiner, Art Unit 2129 /MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129
Read full office action

Prosecution Timeline

Dec 31, 2021
Application Filed
May 12, 2025
Non-Final Rejection — §103
Aug 13, 2025
Response Filed
Sep 03, 2025
Final Rejection — §103
Nov 11, 2025
Response after Non-Final Action
Dec 03, 2025
Request for Continued Examination
Dec 10, 2025
Response after Non-Final Action
Jan 06, 2026
Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12561407
ONE-PASS APPROACH TO AUTOMATED TIMESERIES FORECASTING
2y 5m to grant Granted Feb 24, 2026
Patent 12561559
Neural Network Training Method and Apparatus, Electronic Device, Medium and Program Product
2y 5m to grant Granted Feb 24, 2026
Patent 12554973
HIERARCHICAL DATA LABELING FOR MACHINE LEARNING USING SEMI-SUPERVISED MULTI-LEVEL LABELING FRAMEWORK
2y 5m to grant Granted Feb 17, 2026
Patent 12536260
SYSTEM, APPARATUS, AND METHOD FOR AUTOMATICALLY GENERATING NEGATIVE KEYSTROKE EXAMPLES AND TRAINING USER IDENTIFICATION MODELS BASED ON KEYSTROKE DYNAMICS
2y 5m to grant Granted Jan 27, 2026
Study what changed to get past this examiner. Based on 4 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

3-4
Expected OA Rounds
83%
Grant Probability
99%
With Interview (+33.3%)
3y 9m
Median Time to Grant
High
PTA Risk
Based on 6 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month