Last updated: May 04, 2026
Application No. 17/688,538
REMOTE AGENT IMPLEMENTATION OF REINFORCEMENT LEARNING POLICIES

Final Rejection §103
Filed
Mar 07, 2022
Examiner
CHUANG, SU-TING
Art Unit
2146
Tech Center
2100 — Computer Architecture & Software
Assignee
Microsoft Technology Licensing, LLC
OA Round
2 (Final)
Interview Optional

— +39.7% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 52% grant rate with +39.7% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 101 resolved cases, 2023–2026
Examiner Intelligence

CHUANG, SU-TING View full profile →
Grants 52% of resolved cases
Career Allowance Rate
52 granted / 101 resolved
-3.5% vs TC avg
Strong +40% interview lift
Without
With
+39.7%
Interview Lift
resolved cases with interview
Typical timeline
4y 6m
Avg Prosecution
29 currently pending
Career history
130
Total Applications
across all art units
Statute-Specific Performance

§101
27.5%
-12.5% vs TC avg
§103
46.4%
+6.4% vs TC avg
§102
10.7%
-29.3% vs TC avg
§112
11.6%
-28.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 101 resolved cases
Office Action

§103
DETAILED ACTION
This action is in response the communications filed on 08/04/2025 in which claims 1-4, 10 and
19-20 are amended, claims 22-23 are added and therefore claims 1-23 are pending.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1-21 rejected under 35 U.S.C. 103 as being unpatentable over Nair ("Massively Parallel Methods for Deep Reinforcement Learning" 20150716) in view of Python.Org ("asyncio - Asynchronous I/O >> Streams" 20211128)

In regard to claims 1 and 19, Nair teaches: A method comprising: performing two or more training iterations that involve updating a policy, individual training iterations comprising: (Nair, p. 5 "Algorithm 1 Distributed DQN Algorithm... for episode = 1 to M do [performing two or more training iterations]... for t = 1 to T do [individual training iterations]... "; p. 1 "We present the first massively distributed architecture for deep reinforcement learning… a distributed neural network to represent the value function or behaviour policy..."; p. 2 "A conceptually distinct set of distributed learners reads samples of stored experience from the experience replay memory, and updates the value function or policy according to a given RL algorithm. Specifically, we focus in this paper on a variant of the DQN algorithm, which applies ASGD updates to the parameters of the Q-network. [update a policy]"; the Q-network represents the policy, therefore updating the parameters of the Q-network is updating the policy)


    PNG
    media_image1.png
    740
    1048
    media_image1.png
    Greyscale

    PNG
    media_image2.png
    444
    964
    media_image2.png
    Greyscale


by a training process executing on a training computing device, obtaining experiences representing reactions of an environment to actions taken by a plurality of remote agent processes according to the policy, (Nair, p. 1 "the DistBelief framework (Dean et al., 2012) distributes the neural network parameters across many machines, [learners, executing on a training computing device (server and all learners/workers)] and parallelizes the training [by a training process] by using asynchronous stochastic gradient descent (ASGD).";  p. 1 "We present the first massively distributed architecture for deep reinforcement learning… parallel actors [a plurality of remote agent processes] that generate new behaviour; parallel learners [by a training process executing on a training computing device] that are trained from stored experience;..."; p. 5 "Algorithm 1 Distributed DQN Algorithm... Execute the action in the environment and observe the reward rt and the next state st+1 [reactions of an environment to actions]. Store (st, at, rt, st+1) in D. [obtaining experiences]"; p. 2 "In addition to generating more data, distributed actors can explore the state space more effectively, as each actor behaves according to a slightly different policy."; p. 4 "Each actor contains a replica of the Q-network, which is used to determine behavior, for example using an ε-greedy policy. [actions taken by a plurality of remote agent processes (actors) according to the policy]"; learners obtain experiences generated by the actors)
wherein the remote agent processes execute the policy on remote agent computing devices and the experiences are obtained from the remote agent computing devices over a network; (Nair, p. 1 "We present the first massively distributed architecture for deep reinforcement learning… parallel actors [the remote agent processes/remote agent computing devices] that generate new behaviour"; p. 1 "In order to generate more data, we deploy multiple agents running in parallel that interact with multiple instances of the same environment. Each such actor can store its own record of past experience, effectively providing a distributed experience replay memory with vastly in creased capacity compared to a single machine implementation. Alternatively this experience can be explicitly aggregated into a distributed database [obtained from the remote agent computing devices (actors) over a network]... each actor behaves according to a slightly different policy."; p. 4 "Each actor contains a replica of the Q-network, which is used to determine behavior, for example using an ε-greedy policy. [the remote agent processes (actors) execute the policy]"; parallel actors/agents in a distributed architecture are [remote agent computing devices], and the actors send experiences to a global distributed replay memory)
by the training process, [using reinforcement learning,] updating the policy based on the reactions of the environment to obtain an updated policy; and (Nair, p. 1 "the DistBelief framework (Dean et al., 2012) distributes the neural network parameters across many machines, and parallelizes the training [learners, by the training process] by using asynchronous stochastic gradient descent (ASGD)."; p. 4 "In a single experiment, several learner processes exist and they continuously send the gradients to parameter server and receive updated parameters"; p. 4 "Learners. Gorila contains N_learn learner processes. Each learner contains a replica of the Q-network and its job is to compute desired changes to the parameters of the Q network. [updating the policy, obtain an updated policy] For each learner update k, a minibatch of experience tuple e = (s, a, r, s') [based on the reactions of the environment] is sampled from either a local or global experience replay memory D (see above). The learner applies an off-policy RL algorithm [using RL] such as DQN (Mnih et al., 2013) to this minibatch of experience, in order to generate a gradient vector gi."; leaners do the training to update parameters of the Q-network, i.e. updating the policy)
by the training process, distributing the updated policy over the network to the plurality of remote agent processes… (Nair, p. 5 "The only communication between bundles is via parameters: the learners communicate their gradients to the parameter server; and the Q-networks in the actors and learners are periodically synchronized to the parameter server. [sync: distributing the updated policy over the network to the plurality of remote agent processes]")

Nair does not teach, but Python.Org teaches: using temporary network connections that are closed after the updated policy is distributed. (Python.Org, p. 1 "reader... = await asyncio.open_connection('127.0.0.1', 8888) [using temporary network connections]... data = await reader.read(100) print(f'Received: {data.decode()!r}') [after the updated data/policy is received/distributed]... print('Close the connection') writer.close() await writer.wait_closed() [close the temporary connection]"; in light of spec. [0040] "the implementation shown in FIG. 3 does not necessarily require synchronous communication between the agent 102 and the trainer 302... the agent can open a temporary network connection to retrieve the current policy from the policy queue, close the network connection, and implement the current policy for a period of time to collect a group of experiences (e.g., a training batch)."; temporary network connections are asynchronous communications)

It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Nair to incorporate the teachings of Python.Org by including Python Steams - async/await-ready primitives. Doing so would allow sending and receiving data without using callbacks or low-level protocols and transports. (Python.Org, p. 1 "Streams are high-level async/await-ready primitives to work with network connections. Streams allow sending and receiving data without using callbacks or low-level protocols and transports.")

Claim 19 recites substantially the same limitation as claim 1, therefore the rejection applied to claim 1 also apply to claim 19. In addition, Nair teaches: A system comprising: a training computing device comprising: a processor; and a storage medium storing instructions which, when executed by the processor, cause the training computing device to execute a training process configured to: (Nair, p. 2 "As in DistBelief, the parameters of the Q-network may also be distributed over many machines."; p. 2 "DistBelief (Dean et al., 2012) is a distributed system for training large neural networks on massive amounts of data efficiently by using two types of parallelism. Model parallelism, where different machines are responsible for storing and training different parts of the model..."; DistBelief distributed systems inherently teaches machines with CPU cores and memory [a training computing device (server and all learners/workers), a processor, a storage medium, etc.])

In regard to claim 2, Nair teaches: wherein the experiences are obtained by the training process from an experience data store populated with the experiences by the plurality of remote agent processes (Nair, p. 4 "Experience replay memory. [an experience data store] The experience tuples eti = (sti, ati, rti, sti+1)  generated by the actors are stored in a replay memory D.")
Nair does not teach, but Python.Org teaches: over other temporary network connections that are closed after the experiences are transmitted to the experience data store. (Python.Org, p. 1 "... writer = await asyncio.open_connection('127.0.0.1', 8888) [other temporary network connections]... print(f'Send: {message!r}')    writer.write(message.encode()) await writer.drain() [after the updated data/experiences are sent out]... print('Close the connection') writer.close() await writer.wait_closed() [close the other temporary connection]"); in light of spec. [0040] "the implementation shown in FIG. 3 does not necessarily require synchronous communication between the agent 102 and the trainer 302... Once the group of experiences has been collected, the agent can open another connection to the experience data store 306, publish the batch of experiences, and close the connection.")

In regard to claim 3, Nair teaches: wherein distributing the updated policy comprises sending the updated policy to a policy data store accessible to the plurality of remote agent processes via the temporary network connections. (Nair, p. 4 "Parameter server. Like DistBelief, the Gorila architecture uses a central parameter server [a policy data store] to maintain a distributed representation of the Q-network Q(s,a;θ+)... The parameter server receives gradients from the learners, and applies these gradients to modify the parameter vector θ+, using an asynchronous stochastic gradient descent algorithm."; DistBelief uses an asynchronous stochastic gradient descent procedure, i.e. all the communications (of gradients or weight updates) between workers and parameter server are via asynchronous network connections [via the temporary network connections])

In regard to claim 4, Nair teaches: wherein the experience data store and the policy data store comprise one or more of a shared network folder, a persistent cloud queue, or a memory location on the training computing device, (Nair, p. 4 "Experience replay memory. [the experience data store] The experience tuples eti = (sti, ati, rti, sti+1)  generated by the actors are stored in a replay memory D… a global replay memory aggregates the experience into a distributed database. "; p. 2 "The parameter server [the policy data store] can be sharded across many machines and different shards apply gradients independently of other shards."; a distributed replay memory can be considered as [one or more of a shared network folder, a persistent cloud queue, or a memory location]. A common implementation of a replay memory is a First-In, First-Out (FIFO) queue. Further, shards (separate databases) across machines in a parameter server can be considered as [one or more of a shared network folder or a memory location]) the experience data store and the policy data store being accessible to the remote agent computing devices. (Nair, p. 1 "In order to generate more data, we deploy multiple agents running in parallel that interact with multiple instances of the same environment. Each such actor can store its own record of past experience, effectively providing a distributed experience replay memory [the experience data store being accessible to the remote agent computing devices.] with vastly increased capacity compared to a single machine implementation. Alternatively this experience can be explicitly aggregated into a distributed database"; p. 5"the Q-networks in the actors and learners are periodically synchronized to the parameter server. [the policy data store being accessible to the remote agent computing devices.]"; see Fig. 2 Actor can sync with Parameter Server and can store experiences to Replay Memory)

In regard to claim 5, Nair teaches: further comprising: completing training of the policy responsive to reaching a stopping condition. (Nair, p. 5 "Algorithm 1 Distributed DQN Algorithm... for episode = 1 to M  do ... for t = 1 to T do... Every global N steps [M, T or N: reaching a stopping condition] sync θ- with parameters θ+ from the parameter server.")

In regard to claim 6, Nair teaches: further comprising: responsive to completion of the training, providing a final policy to the plurality of remote agent computing devices. (Nair, p. 5 "Algorithm 1 Distributed DQN Algorithm... for episode = 1 to M  do ... for t = 1 to T do... Every global N steps [M, T or N: completion of the training] sync θ- with parameters θ+ from the parameter server. [sync: (learners) providing a final policy to actors]")

In regard to claim 7, Nair teaches: wherein individual experiences obtained from the remote agent processes include rewards for corresponding actions taken by the remote agent processes in the environment, the rewards being determined according to a reward function. (Nair, p. 5 "Algorithm 1 Distributed DQN Algorithm... Execute the action in the environment and observe the reward rt and the next state st+1. Store (st, at, rt, st+1) in D."; p. 3 "The goal of an RL agent is to maximize its expected total reward, where the rewards are discounted by a factor γ... per time-step. Specifically, the return [the rewards] at time t is Rt = Σ γr [a reward function]...)

In regard to claim 8, Nair teaches: wherein updating the policy involves adjusting internal parameters of a reinforcement learning model to obtain the updated policy. (Nair, p. 4 "The parameter server receives gradients from the learners, and applies these gradients to modify the parameter vector θ+, [(gradients used for) adjusting internal parameters] using an asynchronous stochastic gradient descent algorithm."; p. 5 "Algorithm 1 Distributed DQN Algorithm... Sample random mini-batch from D… Compute gradients… Send gradients to the parameter server… sync θ- with parameters θ+ from the parameter server. [sync: obtain the updated policy]")

In regard to claim 9, Nair teaches: wherein the policy maps environmental context describing states of the environment to probability distributions of potential actions and the remote agent processes randomly select actions according to the probability distributions. (Nair, p. 3 "The agent’s policy π (a|s) maps states to actions and defines its behavior. [the policy maps (s) environmental context describing states of the environment to (a) potential actions]"; p. 5 "Algorithm 1 Distributed DQN Algorithm... With probability ε take a random action at... [randomly select actions according to (π (a|s)) the probability distributions]"; π (a|s) represents a conditional probability distribution over actions given the current state)


    PNG
    media_image3.png
    740
    1048
    media_image3.png
    Greyscale
In regard to claim 10, Nair teaches: A method comprising: performing two or more experience-gathering iterations, individual experience-gathering iterations comprising: (Nair, p. 5 "Algorithm 1 Distributed DQN Algorithm... for episode = 1 to M do [performing two or more experience-gathering iterations]... for t = 1 to T do [individual experience-gathering iterations]... ")


by an agent process executing on an agent computing device… (Nair, p. 1 "We present the first massively distributed architecture for deep reinforcement learning… parallel actors [an agent computing device] that generate new behaviour"; actor is [an agent computing device] )

wherein the training computing device is remote from the agent computing device and includes a training process that generates an updated policy; (Nair, p. 1 "We present the first massively distributed architecture for deep reinforcement learning… parallel actors [the agent computing devices] that generate new behaviour; parallel learners [the training computing device] that are trained from stored experience;"; p. 5 "The only communication between bundles is via parameters: the learners communicate their gradients to the parameter server; and the Q-networks in the actors and learners are periodically synchronized to the parameter server."; actors and learners in the distributed system means they are remote from each other, and a training computing device (server and all learners/workers) generate updated data/policy)
… by the agent process, taking actions in an environment by executing the updated policy locally on the agent computing device; and (Nair, p. 1 "We present the first massively distributed architecture for deep reinforcement learning… parallel actors [the agent process] that generate new behaviour…"; p. 5 "Algorithm 1 Distributed DQN Algorithm... Execute the action in the environment [taking actions in an environment] and observe the reward rt and the next state st+1..."; p. 2 "In addition to generating more data, distributed actors can explore the state space more effectively, as each actor behaves according to a slightly different policy."; p. 4 "Each actor contains a replica of the Q-network, which is used to determine behavior, for example using an ε-greedy policy. [by executing the updated policy locally on the agent computing device]")
by the agent process, publishing experiences representing reactions of the environment to the actions taken according to the updated policy, (Nair, p. 1 "We present the first massively distributed architecture for deep reinforcement learning… parallel actors [the agent process] that generate new behaviour…"; p. 5 "Algorithm 1 Distributed DQN Algorithm... Execute the action in the environment [taking actions in an environment] and observe the reward rt and the next state st+1. Store (st, at, rt, st+1) in D. [publishing experiences representing reactions of the environment to the actions]"; p. 2 "In addition to generating more data, distributed actors can explore the state space more effectively, as each actor behaves according to a slightly different policy."; p. 4 "Each actor contains a replica of the Q-network, which is used to determine behavior, for example using an e-greedy policy. [the actions taken according to the updated policy]")
wherein the experiences are published to the training process to further update the policy for use in a subsequent experience-gathering iteration by the agent process. (Nair, p. 5 "Algorithm 1 Distributed DQN Algorithm... Store (st, at, rt, st+1) in D. [publishing experiences]… Sample random mini-batch from D… Compute gradients… Send gradients… sync θ- with parameters θ+ from the parameter server. [update the policy for use]... for t= 1 to T do With probability ε take a random action at... [in a subsequent experience-gathering iteration by the agent process]")

Nair does not teach, but Python.Org teaches: by an agent process... opening a temporary network connection with a training computing device, (Python.Org, p. 1 "await asyncio.open_connection('127.0.0.1', 8888)"; Nair teaches parallel actors and learners are periodically sync to the parameter server [a training computing device (server and all learners/workers)])
by the agent process, receiving the updated policy from the training computing device over the temporary network connection; (Python.Org, p. 1 "data = await reader.read(100) print(f'Received: {data.decode()!r}')")
by the agent process, closing the temporary network connection after the updated policy has been received from the training computing device; (Python.Org, p. 1 "print('Close the connection') writer.close() await writer.wait_closed()")

    PNG
    media_image4.png
    738
    1432
    media_image4.png
    Greyscale


The rationale for combining the teachings of Nair and Python.Org is the same as set forth in the rejection of claim 1.

In regard to claim 11, Nair teaches: wherein the experiences are published to an experience data store that populated with other experiences by one or more other agent processes that are also remote from the training computing device, and (Nair, p. 4 "Experience replay memory. [an experience data store] The experience tuples eti = (sti, ati, rti, sti+1)  generated by the actors are stored in a replay memory D."; p. 3 "The Gorila architecture contains N_act different actor processes [one or more other agent processes (that are remote from learners, the training computing device)]… Each actor i generates its own trajectories of experiences si1, ai1, ri1,..., siT... within the environment...") the updated policy is updated by the training process based on the experiences and the other experiences. (Nair, p. 4 "Learners. Gorila contains N_learn learner processes. Each learner contains a replica of the Q-network and its job is to compute desired changes to the parameters of the Q network. [the updated policy] For each learner update k, a minibatch of experience tuple e = (s, a, r, s') [based on the experiences and the other experiences] is sampled from either a local or global experience replay memory D (see above). The learner applies an off-policy RL algorithm such as DQN (Mnih et al., 2013) to this minibatch of experience, in order to generate a gradient vector gi.")

In regard to claim 12, Nair teaches: wherein the updated policy is obtained from a policy data store that is accessible by the one or more other agent processes to obtain the updated policy. (Nair, p. 4 "Parameter server. Like DistBelief, the Gorila architecture uses a central parameter server [a policy data store] to maintain a distributed representation of the Q-network Q(s,a;θ+)")

In regard to claim 13, Nair teaches: wherein taking the actions comprises: inputting context features describing the environment into the updated policy; and (Nair, p. 3 "The agent’s policy π (a|s) maps states to actions and defines its behavior."; p. 5 "Algorithm 1 Distributed DQN Algorithm... With probability ε take a random action at or else at = argmax_a Q(s, a;θ)... [inputting context features describing the environment (state s) into the updated policy (Q-network or θ)]"; p. 5 "We preprocessed the 210x160 RGB images by downsampling them to 84x84 and extracting the luminance channel."; also see figure 2, Actor, s is inputted into the Q-network; preprocessed or down sampled images can also be [context features])
selecting the actions based at least on output determined by the updated policy according to the context features. (Nair, p. 3 "The agent’s policy π (a|s) maps states to actions and defines its behavior."; p. 5 "Algorithm 1 Distributed DQN Algorithm... With probability ε take a random action at or else at = argmax_a Q(s, a;θ)... [selecting the actions (at) based at least on output Q(s, a;θ) determined by the updated policy (Q-network or θ) according to the context features (s)]"; p. 5 "We preprocessed the 210x160 RGB images by downsampling them to 84x84 and extracting the luminance channel."; see figure 2, Actor, at = argmax_a Q(s, a;θ) is output of the Q-network; ; preprocessed or down sampled images can also be [context features])

In regard to claim 14, Nair teaches: the output of the updated policy comprising a probability distribution over available actions, the actions being selected randomly from the probability distribution. (Nair, p. 3 "The agent’s policy π (a|s) maps states to actions and defines its behavior."; p. 5 "Algorithm 1 Distributed DQN Algorithm... With probability ε take a random action at... [the actions (at) being selected randomly from the probability distribution π (a|s)]"; π (a|s) represents a conditional probability distribution over actions given the current state)

In regard to claim 15, Nair teaches: further comprising: receiving a final policy from the training process after the two or more experience-gathering iterations; and taking further actions in the environment based at least on the final policy. (Nair, p. 5 "Algorithm 1 Distributed DQN Algorithm... for episode = 1 to M [after the two or more experience-gathering iterations] do ... for t = 1 to T do... Execute the action in the environment [taking further actions] and observe the reward rt and the next state st+1... Every global N steps sync θ- with parameters θ+ from the parameter server. [sync: (actors) receiving a final policy]...";  p. 2 "In addition to generating more data, distributed actors can explore the state space more effectively, as each actor behaves according to a slightly different policy."; p. 4 "Each actor contains a replica of the Q-network, which is used to determine behavior, for example using an ε-greedy policy. [(actors) taking further actions in the environment based at least on the final policy]")

In regard to claim 16, Nair teaches: further comprising: performing the two or more experience-gathering iterations in a training mode and entering inference mode when using the final policy. (Nair, p. 5 "Algorithm 1 Distributed DQN Algorithm... for episode = 1 to M [performing the two or more experience-gathering iterations] do ... for t = 1 to T do... Execute the action in the environment [entering inference mode when using the final policy] and observe the reward rt and the next state st+1... Every global N steps sync θ- with parameters θ+ from the parameter server...";  p. 2 "In addition to generating more data, distributed actors can explore the state space more effectively, as each actor behaves according to a slightly different policy."; p. 4 "Each actor contains a replica of the Q-network, which is used to determine behavior, for example using an ε-greedy policy. [(actors) entering inference mode when using the final policy]")

In regard to claim 17, Nair teaches: further comprising: computing rewards for the reactions of the environment to the actions taken by the agent; and publishing the rewards with the experiences. (Nair, p. 5 "Algorithm 1 Distributed DQN Algorithm... Execute the action in the environment and observe the reward rt and the next state st+1. Store (st, at, rt, st+1) in D. [publishing the rewards with the experiences]"; p. 3 "The goal of an RL agent is to maximize its expected total reward, where the rewards are discounted by a factor γ... per time-step. Specifically, the return [computing rewards] at time t is Rt = Σ γr... The action-value function Q_pi(s, a) is the expected return after observing state st and taking an action under a a policy pi...")

In regard to claim 18, Nair teaches: the updated policy comprising a neural network having a convolutional layer, the environment comprising video from an application, wherein taking the actions involves inputting the video to the neural network and selecting the actions based on output of the neural network, and the actions involve providing control inputs to the application. (Nair, p. 5 "The Q-network Q(s, a;θ) had 3 convolutional layers [a convolutional layer] followed by a fully-connected hidden layer."; p. 5 "We evaluated Gorila by conducting experiments on 49 Atari 2600 games [an application] using the Arcade Learning Environment (Bellemare et al., 2012)... where an agent must learn to play the games [selecting actions, providing control inputs to the applications/games] directly from 210x160 RGB video input [inputting the video] with only the changes in the score provided as rewards.")

In regard to claim 20, Nair teaches: further comprising the remote agent computing devices, wherein the remote agent processes are configured to: perform two or more iterations of an experience-gathering process in a training mode by gathering the experiences according to at least two corresponding iterations of the updated policy provided by the training process to the plurality of remote agent processes, (Nair, p. 5 "Algorithm 1 Distributed DQN Algorithm... for episode = 1 to M do [performing two or more training iterations in a training mode]... for t = 1 to T do... Execute the action in the environment and observe the reward rt and the next state st+1. Store (st, at, rt, st+1) in D. [gather the experiences]... Every global N steps [according to at least two corresponding iterations] sync θ- with parameters θ+ from the parameter server. [sync: the updated policy provided by the training process (learners) to the plurality of remote agent processes (actors)]")
… responsive to receiving a final policy from the training process, enter inference mode and take further actions in the environment by executing the final policy. (Nair, p. 5 "Algorithm 1 Distributed DQN Algorithm... for episode = 1 to M  do ... for t = 1 to T do... Every global N steps sync θ- with parameters θ+ from the parameter server. [sync: (actors) receiving a final policy]";  p. 2 "In addition to generating more data, distributed actors can explore the state space more effectively, as each actor behaves according to a slightly different policy."; p. 4 "Each actor contains a replica of the Q-network, which is used to determine behavior, for example using an ε-greedy policy. [(actors) entering inference mode and taking actions by executing the final policy]")
Nair does not teach, but Python.Org teaches: the experiences being gathered over other temporary network connections that are closed after the experiences are transmitted by the remote agent computing devices to the training computing device; and (Python.Org, p. 1 "... writer = await asyncio.open_connection('127.0.0.1', 8888) [other temporary network connections]... print(f'Send: {message!r}')    writer.write(message.encode()) await writer.drain() [after the updated data/experiences are sent out]... print('Close the connection') writer.close() await writer.wait_closed() [close the other temporary connection]"); in light of spec. [0040] "the implementation shown in FIG. 3 does not necessarily require synchronous communication between the agent 102 and the trainer 302... Once the group of experiences has been collected, the agent can open another connection to the experience data store 306, publish the batch of experiences, and close the connection.")

In regard to claim 21, Nair teaches: the two or more training iterations being performed in the absence of a persistent network connection with the plurality of remote agent computing devices. (Nair, p. 4 "The Gorila architecture provides considerable flexibility in the number of ways an RL agent may be parallelized... the Gorila architecture in general allows for arbitrary numbers of actors, learners, and parameter servers to both generate data, learn from that data, and update the model in a scalable and fully distributed fashion. [training being performed in a fully parallel distributed manner]"; p. 1 "We present the first massively distributed architecture for deep reinforcement learning. This architecture uses four main components: parallel actors that generate new behaviour; parallel learners that are trained from stored experience; a distributed neural network to represent the value function or behaviour policy; and a distributed store of experience."; in light of specification [0086] "by storing policies and experiences at network locations accessible to both the agents and the trainer, the trainer and agents can act in parallel without explicit coordination. This further facilitates debugging... in the absence of persistent network connections."; because all parallel connections between multiple actors, learners and distributed data store available over a network, thus training can be done without persistent network connections, i.e. [in the absence of a persistent network connection])

Claims 22-23 rejected under 35 U.S.C. 103 as being unpatentable over Nair and Python.Org as applied to claim 11, and in further view of Firoiu ("Beating the World’s Best at Super Smash Bros. Melee with Deep Reinforcement Learning" 20170508)

In regard to claim 22, Nair and Python.Org do not teach, but Firoiu teaches:  further comprising, by the agent process in a single experience-gathering iteration: (Firoiu, p. 3 "The many parallel agents periodically send their experiences to a trainer, which maintains a circular queue of the most recent experiences."; the most recent multiple experiences are saved in a queue and sent out to the trainer in a single iteration)
taking multiple actions by generating video game control inputs at a frame rate associated with a video game; and (Firoiu, p. 1 "We focus on Super Smash Bros. Melee (SSBM), [a video game] a fast-paced multi-player fighting game released in 2001 for the Nintendo Gamecube."; p. 2 "The game runs natively at 60 frames per second, which we lower to 30 [at a frame rate FPS associated with a video game] by skipping every other frame... The GameCube controller has two analog sticks, five buttons, two triggers, and a directional pad, all of which are relevant in SSBM. To make things easier, we eliminate most of the inputs, leaving only 9 discrete positions on the main analog stick and 5 buttons (at most one of which may be pressed at a time), for a total of 54 discrete actions. [multiple actions by generating control inputs]")
publishing, to the training process, multiple reactions of the video game in a single network communication over the temporary network connection. (Firoiu, p. 3 "Overall our approach most closely resembles DeepMind’s Asynchronous Advantage Actor-Critic [Mnih et al., 2016], although we do not perform asynchronous gradient updates (merely asynchronous experience generation)"; p. 3 "The many parallel agents periodically send their experiences to a trainer, [publishing, to the training process, multiple reactions of the video game in a single network communication] which maintains a circular queue of the most recent experiences... This asynchronous setup... The upside is that no time is wasted waiting on the part of either the agents or the trainer"; p. 3 "experiences (state-action-reward sequences)"; p. 2 "In the RL literature, our approach might be referred to as n-step SARSA."; the most recent multiple experiences (which includes reactions s' of the video game) are saved in a queue and sent out to the trainer in one communication [in a single network communication], i.e. as opposed to not using a queue and sending out a single experience every time (as a result, multiple communications are used); Experiences at the trainer side are updated asynchronously, i.e. communications between the agents and the trainer are via asynchronous network connections [over the temporary network connections]; experiences are sequences of (s, a, r, s', a', r'))

It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Nair and Python.Org to incorporate the teachings of Firoiu by including asynchronous generation for multiple experiences. Doing so would achieve that no time is wasted waiting on the part of either the agents or the trainer. (Firoiu, p. 3 "The many parallel agents periodically send their experiences to a trainer... This asynchronous setup technically breaks the assumption of the REINFORCE learning rule that the data is generated from the current policy network (in reality the network has since been updated by a few gradient steps), but in practice this does not appear to be a problem, likely because the gradient steps are sufficiently small to not change the policy significantly in the time that an experience sits in the queue. The upside is that no time is wasted waiting on the part of either the agents or the trainer.")

In regard to claim 23, Nair and Python.Org do not teach, but Firoiu teaches: further comprising: determining a frame rate at which a video game is set to accommodate human users; (Firoiu, p. 2 "The game runs natively at 60 frames per second, which we lower to 30 by skipping every other frame… To better match human play, we would lower this further by skipping more frame… [determining a frame rate FPS to accommodate human users]")
increasing the frame rate of the video game to an increased frame rate, the increased frame rate being based at least on hardware capabilities of the agent computing device; and (Firoiu, p. 3 "Despite being 15 years old, SSBM is not trivial to emulate 1. Empirically, we found that, while a modern CPU can reach framerates of about 5x real time… This is quite slow compared to the performance-engineered Atari Learning Environment... We remedy this by running many different emulators in parallel, typically 50 or more per experiment. 2 [increasing the frame rate based on hardware capabilities]"; p. 3 "2Computing resources were provided by the Mass. Green High-Performance Computing Center")
by the agent process in a single experience-gathering iteration: (Firoiu, p. 3 "The many parallel agents periodically send their experiences to a trainer, which maintains a circular queue of the most recent experiences."; the most recent multiple experiences are saved in a queue and sent out to the trainer in a single iteration)
taking multiple actions by generating video game control inputs at the increased frame rate of the video game; and (Firoiu, p. 3 "We remedy this by running many different emulators in parallel, typically 50 or more per experiment. 2 [emulators run at the increased frame rate] The many parallel agents periodically send their experiences to a trainer, which maintains a circular queue of the most recent experiences. [multiple experiences including multiple actions]"; p. 3 "experiences (state-action-reward sequences)"; p. 2 "In the RL literature, our approach might be referred to as n-step SARSA."; experiences are sequences of (s, a, r, s', a', r'))
The rationale for combining the teachings of Nair, Python.Org and Firoiu is the same as set forth in the rejection of claim 22.

Response to Arguments
Applicant's arguments with respect to the rejection of the claims under 35 U.S.C. 101 have been fully considered, and the 101 Rejection has been withdrawn.

Applicant's arguments with respect to the rejection of the claims under 35 U.S.C. 102 have been fully considered but they are moot:

Applicant argues: (p. 15) However, Nair is at best silent as to whether the synchronization of the Q-networks to the actors is over temporary network connections. Thus, Nair does not teach or suggest at least "by the agent process, closing the temporary network connection after the updated policy has been received from the training computing device," as recited by amended independent claim 10… Nair is at best silent as to the use of temporary network connections that are closed after experiences are transmitted by Nair's actors… 
Examiner answers: the arguments do not apply to the references (Python.Org) being used in the current rejection.

Applicant argues: (p. 16) Nair is at best silent as the use of a single network communication to publish multiple reactions of a video game by Nair's actors. Thus, Nair does not teach or suggest at least the above- emphasized features of new dependent claim 22…Nair is at best silent as increasing the frame rate of a video game. Thus, Nair does not teach or suggest at least the above-emphasized features of new dependent claim 23.
Examiner answers: the arguments do not apply to the references (Firoiu) being used in the current rejection.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.  
Samsami ("Distributed Deep Reinforcement Learning: An Overview" 20201122) teaches (Samsami, p. 4 "The General Reinforcement Learning Architecture, GORILA (Nair et al., 2015), is an asynchronous distributed reinforcement learning architecture.")

    PNG
    media_image5.png
    499
    653
    media_image5.png
    Greyscale

Dean ("Large Scale Distributed Deep Networks" 2012) teaches (Dean, p. 3 "Figure 1: An example of model parallelism in DistBelief… computation for individual nodes will the parallelized across all available CPU cores."; p. 3 "The performance benefits of distributing a deep network across multiple machines... Models with a large number of parameters or high computational demands typically benefit from access to more CPUs and memory")

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to SU-TING CHUANG whose telephone number is (408)918-7519. The examiner can normally be reached Monday - Thursday 8-5 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew J. Jung can be reached on (571) 270-3779. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/S.C./Examiner, Art Unit 2146                                                                                                                                                                                                        
/ANDREW J JUNG/Supervisory Patent Examiner, Art Unit 2146
Read full office action
Prosecution Timeline

Mar 07, 2022
Application Filed
May 31, 2025
Non-Final Rejection — §103
Jun 25, 2025
Examiner Interview Summary
Jun 25, 2025
Applicant Interview (Telephonic)
Aug 04, 2025
Response Filed
Oct 21, 2025
Final Rejection — §103
Nov 21, 2025
Applicant Interview (Telephonic)
Nov 21, 2025
Examiner Interview Summary
Precedent Cases

Applications granted by this same examiner with similar technology

17/096,494
Patent 12608618
AUTOMATIC AND UNSUPERVISED DETACHED SUBGRAPH DETECTION IN DEEP LEARNING PROGRAMS
5y 5m to grant Granted Apr 21, 2026
16/953,977
Patent 12561600
LINEAR TIME ALGORITHMS FOR PRIVACY PRESERVING CONVEX OPTIMIZATION
5y 3m to grant Granted Feb 24, 2026
16/984,909
Patent 12518154
TRAINING MULTIMODAL REPRESENTATION LEARNING MODEL ON UNNANOTATED MULTIMODAL DATA
5y 5m to grant Granted Jan 06, 2026
17/224,858
Patent 12481725
SYSTEMS AND METHODS FOR DOMAIN-SPECIFIC ENHANCEMENT OF REAL-TIME MODELS THROUGH EDGE-BASED LEARNING
4y 7m to grant Granted Nov 25, 2025
16/540,414
Patent 12468951
Unsupervised outlier detection in time-series data
6y 3m to grant Granted Nov 11, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
52%
Grant Probability
91%
With Interview (+39.7%)
4y 6m (~4m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 101 resolved cases by this examiner. Grant probability derived from career allowance rate.