Last updated: May 04, 2026

Application No. 18/146,061

DISTRIBUTED REINFORCEMENT LEARNING SYSTEM AND DISTRIBUTED REINFORCEMENT LEARNING METHOD

Non-Final OA §103

Filed

Dec 23, 2022

Priority

Jul 03, 2020 — JP 2020-115849 +1 more

Examiner

CHOI, DAVID E

Art Unit

2148

Tech Center

2100 — Computer Architecture & Software

Assignee

Preferred Networks Inc.

OA Round

1 (Non-Final)

Interview Optional

— +12.1% interview lift. Interview lift (+12.1%) is below the 15.0% threshold. A written response is recommended.

Based on 598 resolved cases, 2023–2026

Examiner Intelligence

CHOI, DAVID E View full profile →

Grants 75% — above average

Career Allowance Rate

450 granted / 598 resolved

+20.3% vs TC avg

Moderate +12% lift

Without

With

+12.1%

Interview Lift

resolved cases with interview

Typical timeline

2y 11m

Avg Prosecution

15 currently pending

Career history

613

Total Applications

across all art units

Statute-Specific Performance

§101

6.6%

-33.4% vs TC avg

§103

65.9%

+25.9% vs TC avg

§102

17.8%

-22.2% vs TC avg

§112

1.9%

-38.1% vs TC avg

Black line = Tech Center average estimate • Based on career data from 598 resolved cases

Office Action

§103

DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 
2.	This action is responsive to the following communication:  Original claims filed 12/23/22.  This action is made non-final.
3.	Claims 1-20 are pending in the case.  Claims 1, 19 and 20 are independent claims.

35 USC § 112(f)
4.	The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph: 
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
5.      With regard to claim 1, claim limitations “actor device configured to”, “replay buffer configured to” and “learner device configured to” have been interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because they use a generic placeholder (e.g. “device”) coupled with functional language (e.g. “configured to”, etc.) without reciting sufficient structure to achieve the function.  Furthermore, the generic placeholder is not preceded by a structural modifier.  
Absence of the word “means” (or “step for”) in a claim creates a rebuttable presumption that the claim element is not to be treated in accordance with 35 U.S.C. 112(f) (pre-AIA  35 U.S.C. 112, sixth paragraph).  The presumption that 35 U.S.C. 112(f) (pre-AIA  35 U.S.C. 112, sixth paragraph) is not invoked is rebutted when the claim element recites function but fails to recite sufficiently definite structure, material or acts to perform that function. 
Claim elements in this application that use the word “means” (or “step for”) are presumed to invoke 35 U.S.C. 112(f) except as otherwise indicated in an Office action.  Similarly, claim elements that do not use the word “means” (or “step for”) are presumed not to invoke 35 U.S.C. 112(f) except as otherwise indicated in an Office action.
Since the claim limitation(s) invokes 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, claims 2-18 have been interpreted to cover the corresponding structure described in the specification that achieves the claimed function, and equivalents thereof.  
If applicant wishes to provide further explanation or dispute the examiner’s interpretation of the corresponding structure, applicant must identify the corresponding structure with reference to the specification by page and line number, and to the drawing, if any, by reference characters in response to this Office action. 
If applicant does not intend to have the claim limitation(s) treated under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112 , sixth paragraph, applicant may amend the claim(s) so that it/they will clearly not invoke 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, or present a sufficient showing that the claim recites/recite sufficient structure, material, or acts for performing the claimed function to preclude application of 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.  
For more information, see MPEP § 2173 et seq. and Supplementary Examination Guidelines for Determining Compliance With 35 U.S.C. 112 and for Treatment of Related Issues in Patent Applications, 76 FR 7162, 7167 (Feb. 9, 2011).

Claim Rejections - 35 USC § 103
7.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

8.	Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Pietquin (US 20200151562) in view of Eberle (US 20200145725).
Regarding claim 1, Pietquin discloses a distributed reinforcement learning system comprising:
one or more actor devices configured to acquire experience data, the experience data being used for reinforcement learning and corresponding to an action determined based on a model to be trained (the operation transition data may comprise experience tuples, each comprising state data, action data, new state data, and reward data, for example (s, a, s′, r). The stored experience tuples are used for training the reinforcement learning system, paragraph 0052);
a plurality of replay buffers configured to store the experience data acquired from the one or more actor devices (The reinforcement learning system 100 includes a replay buffer 150, that is memory which stores reinforcement learning transitions as operation transition data, generated as a consequence of the agent 102a interacting with the environment 104 during operation of the RL system, paragraph 0052); and
one or more learner devices configured to train the model in the reinforcement learning, the reinforcement learning using the experience data stored in the plurality of replay buffers (Thus the demonstrator 102b interacting with the environment 104 generates reinforcement learning transitions comprising demonstration transition data, which may similarly comprise demonstration experience tuples of state data, action data, new state data, and reward data, for example (s, a, s′, r). These are likewise stored in the replay buffer 150 for use in training the reinforcement learning system. Storing the demonstration experience tuples helps to maintain transitions which may include sparse rewards, and thus facilitates propagating the sparse rewards during training, paragraph 0055),
Pietquin does not disclose wherein the plurality of replay buffers are distributed and arranged in a plurality of nodes.
	However, Eberle discloses wherein How the reorder and replay buffers are accessed with the help of the hash table 306, in one embodiment, is depicted in the transceiver arrangement 400 of FIG. 4 (see paragraph 0088).
	The combination of Pietquin and Eberle would have resulted in the replay buffers to further incorporate arranging them in a node.  One would have been motivated to have utilized Pietquin and combine it with Eberle as a user of Pietquin is already interested in creating replay buffers and reinforcement learning on the same neural network.  As such, the combination of references would have yielded in a predictable invention as the teachings were obvious to one of ordinary skill in the art and their combination would have been common in the art.
Regarding claim 2, Pietquin does not disclose wherein each node of the plurality of nodes is a single computer.
	However, Eberle discloses wherein in FIG. 4 wherein the buffers/nodes are on a single arrangement. 
	The combination of Pietquin and Eberle would have resulted in the replay buffers to further incorporate arranging them in a node.  One would have been motivated to have utilized Pietquin and combine it with Eberle as a user of Pietquin is already interested in creating replay buffers and reinforcement learning on the same neural network.  As such, the combination of references would have yielded in a predictable invention as the teachings were obvious to one of ordinary skill in the art and their combination would have been common in the art.
Regarding claim 3, Pietquin discloses wherein the experience data stored by each replay buffer of the plurality of replay buffers is different from the experience data stored by other replay buffers of the plurality of replay buffers (When the replay buffer is full the oldest operation transition data experience tuples may be discarded. However in some implementations some or all of the demonstration experience tuples are maintained in the replay buffer throughout training of the RL system, discarding, for example overwriting, the operation transition data in preference to the demonstration transition data. The one or more demonstrations may be performed before beginning to train the RL system, see paragraph 0056).
Regarding claim 4, Pietquin discloses wherein each replay buffer of the plurality of replay buffers is associated with one or more learner devices (Thus the demonstrator 102b interacting with the environment 104 generates reinforcement learning transitions comprising demonstration transition data, which may similarly comprise demonstration experience tuples of state data, action data, new state data, and reward data, for example (s, a, s′, r). These are likewise stored in the replay buffer 150 for use in training the reinforcement learning system. Storing the demonstration experience tuples helps to maintain transitions which may include sparse rewards, and thus facilitates propagating the sparse rewards during training, paragraph 0055).
Regarding claim 5, Pietquin discloses wherein a first learner device of the one or more learner devices acquires the experience data that is used for the reinforcement learning, from a replay buffer that is associated with the first learner device among the plurality of replay buffers (Thus the demonstrator 102b interacting with the environment 104 generates reinforcement learning transitions comprising demonstration transition data, which may similarly comprise demonstration experience tuples of state data, action data, new state data, and reward data, for example (s, a, s′, r). These are likewise stored in the replay buffer 150 for use in training the reinforcement learning system. Storing the demonstration experience tuples helps to maintain transitions which may include sparse rewards, and thus facilitates propagating the sparse rewards during training, paragraphs 0052-0055).
Regarding claim 6, Pietquin discloses wherein a first learner device of the one or more learner devices does not acquire the experience data that is used for the reinforcement learning from a replay buffer that is not associated with the first learner device among the plurality of replay buffers (The reinforcement learning system 100 includes a replay buffer 150, that is memory which stores reinforcement learning transitions as operation transition data, generated as a consequence of the agent 102a interacting with the environment 104 during operation of the RL system. The operation transition data may comprise experience tuples, each comprising state data, action data, new state data, and reward data, for example (s, a, s′, r). The stored experience tuples are used for training the reinforcement learning system, see paragraphs 0052-0055).
Regarding claim 7, Pietquin discloses further comprising one or more controllers configured to acquire the experience data from the one or more actor devices and distribute the acquired experience data to the plurality of replay buffers (see FIG. 1 wherein the experience data is captured from the replay buffers).
Regarding claim 8, Pietquin discloses wherein each learner device of the one or more learner devices includes the model that is identical and updates parameters of the included model by using a gradient that is common with another learner device (he system may be configured to update the learning actor neural network using a deterministic policy gradient comprising a product of a gradient of the output of the learning critic neural network and a gradient of the output of the learning actor neural network evaluated using the stored tuples of both the operation transition data and the demonstration transition data. The system may be configured to, at intervals, update weights of the target actor neural network using the learning actor neural network and to update weights of the target critic neural network using the learning critic neural network. The at intervals updating may involve the weights of the target actor and critic neural networks slowly tracking those of the learning actor and critic neural networks, see paragraph 0014).
Regarding claim 9, Pietquin discloses wherein each learner device the one or more learner devices calculates the gradient of the model based on the experience data and transmits the gradient to another learner device (the learning actor network may be updated using a deterministic policy gradient, integrating (averaging) over just the state space rather than both the state and action space, because the action policy is deterministic based upon the state. This policy gradient defines the performance of a policy function mapping the state data to the action data. The policy gradient may be approximated from a product of the gradient of the output of the (learning) critic neural network with respect to actions and the gradient of the output of the (learning) actor neural network with respect to its weights. These gradients may again evaluated on a sample or minibatch taken from the replay buffer, see paragraph 0026).
Regarding claim 10, Pietquin discloses wherein the one or more actor devices repeatedly acquire information related to the model from the one or more learner devices at periodic intervals (The RL system then learns through interaction with environment 104. Thus the RL system repeatedly selects an action using the actor neural network 110 and updates the replay buffer (step 206), updates the actor and critic neural networks using the target actor and critic neural networks and data from the replay buffer (step 208) and updates the target actor and critic neural networks, either at intervals or by tracking the learned networks as previously described (step 210). The learning steps 206-210 are performed repeatedly as the RL system is trained. In some implementations the update step 208 is performed multiple times for each environment interaction step. This facilitates efficient use of the data stored in the replay buffer although at the risk of learning from stale data which can result in incorrect Q-values and unstable learning. In practice >10, for example 10-100 update steps may be performed for each environment interaction step, paragraph 0066).
Regarding claim 11, Pietquin discloses wherein each learner device of the one or more learner devices is associated with a corresponding replay buffer among the plurality of replay buffers on a one-to-one basis (The RL system then learns through interaction with environment 104. Thus the RL system repeatedly selects an action using the actor neural network 110 and updates the replay buffer (step 206), updates the actor and critic neural networks using the target actor and critic neural networks and data from the replay buffer (step 208) and updates the target actor and critic neural networks, either at intervals or by tracking the learned networks as previously described (step 210). The learning steps 206-210 are performed repeatedly as the RL system is trained. In some implementations the update step 208 is performed multiple times for each environment interaction step. This facilitates efficient use of the data stored in the replay buffer although at the risk of learning from stale data which can result in incorrect Q-values and unstable learning. In practice >10, for example 10-100 update steps may be performed for each environment interaction step, paragraph 0066).
Regarding claim 12, Pietquin discloses wherein a plurality of learner devices of the one or more learner devices are associated with a single replay buffer of the plurality of replay buffers (The RL system then learns through interaction with environment 104. Thus the RL system repeatedly selects an action using the actor neural network 110 and updates the replay buffer (step 206), updates the actor and critic neural networks using the target actor and critic neural networks and data from the replay buffer (step 208) and updates the target actor and critic neural networks, either at intervals or by tracking the learned networks as previously described (step 210). The learning steps 206-210 are performed repeatedly as the RL system is trained. In some implementations the update step 208 is performed multiple times for each environment interaction step. This facilitates efficient use of the data stored in the replay buffer although at the risk of learning from stale data which can result in incorrect Q-values and unstable learning. In practice >10, for example 10-100 update steps may be performed for each environment interaction step, paragraph 0066).
Regarding claim 13, Pietquin does not disclose wherein a plurality of learner devices of the one or more learner devices and one or more replay buffers of the plurality of replay buffers are implemented in a single computer.
However, Eberle discloses wherein in FIG. 4 wherein the buffers/nodes are on a single arrangement. 
	The combination of Pietquin and Eberle would have resulted in the replay buffers to further incorporate arranging them in a node.  One would have been motivated to have utilized Pietquin and combine it with Eberle as a user of Pietquin is already interested in creating replay buffers and reinforcement learning on the same neural network.  As such, the combination of references would have yielded in a predictable invention as the teachings were obvious to one of ordinary skill in the art and their combination would have been common in the art.
Regarding claim 14, Pietquin discloses wherein each learner device of the one or more learner devices is implemented by a graphics processing unit (The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). For example, the processes and logic flows can be performed by and apparatus can also be implemented as a graphics processing unit (GPU), paragraph 0082).
Regarding claim 15, Pietquin discloses wherein the experience data that is acquired by a first actor device among the one or more actor devices includes data related to a state of an environment observed by the first actor device, data related to the action performed by the first actor device based on the state of the observed environment and the model, data related to a reward obtained by the first actor device as a result of the action, and data related to the state of the observed environment (In broad terms a critic neural network receives state, action, and reward/return data and defines a value function for an error signal, more particularly a TD error signal, which is used to train both actor and critic neural networks. An actor neural network receives the state data and has an action data output defining one or more actions in a continuous action space. The actor-critic system may be implemented in an asynchronous, multi-threaded manner comprising a set of worker agents each with a separate thread, a copy of the environment and experience, and with respective network parameters used to update a global network, see paragraph 0023).
Regarding claim 16, Pietquin discloses wherein each of the plurality of replay buffers is associated with any one of groups of one or more actor devices to store the experience data acquired from an associated group of one or more actor devices (A replay buffer stores reinforcement learning transitions comprising operation transition data from operation of the system. The operation transition data may comprise tuples, that is data groups, comprising the state data, the action data, the reward data, and new state data representing the new state. A second input, which may be the same input as the first input, receives training data defining demonstration transition data for demonstration transitions. The demonstration transition data may comprise a set of the tuples from a demonstration of the task, such as a control task, within the environment. Thus reinforcement learning transitions stored in the replay buffer include the demonstration transition data. The neural network system may be configured to train the at least one actor neural network and the at least one critic neural network off-policy. The off-policy training may use the error signal and stored tuples from the replay buffer comprising tuples from both the operation transition data and the demonstration transition data, paragraph 0030).
Regarding claim 17, Pietquin discloses wherein the one or more learner devices are configured to transmit the trained model to the one or more actor devices, the transmitted trained model being used by the one or more actor devices in a next episode (The reinforcement learning system 100 selects actions to be performed by a reinforcement learning agent 102a interacting with an environment 104. That is, at each of a plurality of internal time steps, t, the reinforcement learning system 100 receives an observation characterizing a respective state s of the environment 104. In response to the observation the reinforcement learning system 100 selects an action a from a continuous action space, to be performed in response to the observation, and then instructs or otherwise causes the agent 102a to perform the selected action. After the agent 102a performs a selected action, the environment 104 transitions to a new state s′ and the system 100 receives another observation characterizing the new state s′ and a scalar reward r=R (s, a). The reward may be a numeric value that results from completing the task being performed by the agent 102a, see paragraph 0051).
Regarding claim 18, wherein: the one or more actor devices are configured to take actions determined based on the model repeatedly in a first episode; the distributed plurality of replay buffers are configured to store the experience data corresponding to the actions taken by the one or more actor devices in the first episode; and the one or more learner devices are configured to train the model used in the first episode using the experience data stored in the plurality of replay buffers, wherein the trained model is used for the one or more actor devices in a second episode (The RL system then learns through interaction with environment 104. Thus the RL system repeatedly selects an action using the actor neural network 110 and updates the replay buffer (step 206), updates the actor and critic neural networks using the target actor and critic neural networks and data from the replay buffer (step 208) and updates the target actor and critic neural networks, either at intervals or by tracking the learned networks as previously described (step 210). The learning steps 206-210 are performed repeatedly as the RL system is trained. In some implementations the update step 208 is performed multiple times for each environment interaction step. This facilitates efficient use of the data stored in the replay buffer although at the risk of learning from stale data which can result in incorrect Q-values and unstable learning. In practice >10, for example 10-100 update steps may be performed for each environment interaction step, see paragraph 0066).
Regarding claim 19, the subject matter of the claim is substantially similar to claim 1 and as such the same rationale of rejection applies. 
Regarding claim 20, the subject matter of the claim is substantially similar to claim 1 and as such the same rationale of rejection applies. 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DAVID E CHOI whose telephone number is (571)270-3780.  The examiner can normally be reached on M-F: 7-2, 7-10 (PST). If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bechtold, Michelle T. can be reached on (571) 431-0762.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/DAVID E CHOI/Primary Examiner, Art Unit 2148

Read full office action

Prosecution Timeline

Dec 23, 2022

Application Filed

Jan 31, 2026

Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/347,841

Patent 12608623

DESIGN OPTIMIZATION FOR FINITE STATE MACHINE MODELED SYSTEMS

2y 9m to grant Granted Apr 21, 2026

17/571,098

Patent 12602396

TRANSFORMING MODEL DATA

4y 3m to grant Granted Apr 14, 2026

17/936,045

Patent 12585995

Capturing Data Properties to Recommend Machine Learning Models for Datasets

3y 5m to grant Granted Mar 24, 2026

17/954,059

Patent 12585957

SYSTEM AND METHOD FOR EFFICIENT ESTIMATION OF CUMULATIVE DISTRIBUTION FUNCTION

3y 5m to grant Granted Mar 24, 2026

18/977,648

Patent 12580878

METHOD, APPARATUS, DEVICE AND STORAGE MEDIUM FOR PRESENTING SESSION MESSAGE

1y 3m to grant Granted Mar 17, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2

Expected OA Rounds

75%

Grant Probability

87%

With Interview (+12.1%)

2y 11m (~0m remaining)

Median Time to Grant

Low

PTA Risk

Based on 598 resolved cases by this examiner. Grant probability derived from career allowance rate.