Office Action Analysis: 18097677 — MULTI-AGENT-BASED REINFORCEMENT LEARNING SYSTEM AND METHOD THEREFOR

Office Action

§101 §103 §112
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This action is responsive to the application filed on January 17th, 2023. Claims 1-18 are pending in the case. Claims 1 and 11 are independent claims.
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55. The information disclosure statement (IDS) submitted on January 17th, 2023 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.

Specification
The use of the term MuJoCo in ¶76, ¶79, which is a trade name or a mark used in commerce, has been noted in this application. The term should be accompanied by the generic terminology; furthermore the term should be capitalized wherever it appears or, where appropriate, include a proper symbol indicating use in commerce such as ™, SM, or ® following the term.
Although the use of trade names and marks used in commerce (i.e., trademarks, service marks, certification marks, and collective marks) are permissible in patent applications, the proprietary nature of the marks should be respected and every effort made to prevent their use in any manner which might adversely affect their validity as commercial marks.


Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim 1-10 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 1 recites the limitation "transmit the data set to the slave agent" in line 14.  There is insufficient antecedent basis for this limitation in the claim. It is unclear whether “the data set” refers to “a data set collected in each state of a first environment” in line 5, “a data set received from a master agent” in line 7, or “a data set collected in each state of a second environment” in line 12. For examination purposes this limitation has been interpreted as "transmit the data set collected in each state of a second environment to the slave agent".
Claims 2-10 are rejected for being dependent on a rejected base claim without curing any of the deficiencies. 

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-18 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.

Regarding claim 1:
Step 1: Claim 1 is directed to A multi-agent-based reinforcement learning system, therefore it falls under the statuary category of a machine.
Step 2A Prong 1: The claim recites, in part:
“learn a Q-function based on the data set stored in the first buffer” this encompasses the mental learning of a Q-function based on an observed data set. Further, this limitation is a mathematical concept. 
“update a Q-function matched with the slave agent among a plurality of Q-functions” this encompasses the mental updating of an observed Q-function. Further, this limitation is a mathematical concept. 
“perform reinforcement learning based on the data set stored in the second buffer” this limitation is a mathematical concept.
Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “store a data set collected in each state of a first environment in a first buffer”, “store a data set received from a master agent in the first buffer”, “store a data set collected in each state of a second environment in a second buffer”, “transmit the data set to the slave agent” these limitations are an additional element that amounts to adding insignificant extra-solution activity to the judicial exception.  See MPEP § 2106.05(g).  “”
Step 2B: The additional elements, taken individually and in combination, do not provide an inventive concept of significantly more than the abstract idea itself for the reasons set forth in step 2A prong 2 above. Furthermore, “store a data set collected in each state of a first environment in a first buffer”, “store a data set received from a master agent in the first buffer”, “store a data set collected in each state of a second environment in a second buffer”, “transmit the data set to the slave agent” these limitations are an additional element that amounts to adding insignificant extra-solution activity to the judicial exception.  See MPEP § 2106.05(g). Furthermore the additional elements are directed to receiving or transmitting data over a network, e.g., using the Internet to gather data, Symantec, 838 F.3d at 1321, 120 USPQ2d as well as directed to storing and retrieving information in memory, Versata Dev. Group, Inc. v. SAP Am., Inc., 793 F.3d 1306, 1334, 115 USPQ2d 1681, 1701 (Fed. Cir. 2015). See MPEP § 2106.05(d)/(II). Therefore, the claim is ineligible.

Regarding claim 2, the rejection of claim 1 is incorporated and further:
Step 2A Prong 1: The claim recites, in part:
Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “the master agent is configured to transmit the data set to the slave agent with a preset probability” the limitation is an additional element that amounts to adding insignificant extra-solution activity to the judicial exception.  See MPEP § 2106.05(g). 
Step 2B: The additional elements, taken individually and in combination, do not provide an inventive concept of significantly more than the abstract idea itself for the reasons set forth in step 2A prong 2 above. Furthermore, “the master agent is configured to transmit the data set to the slave agent with a preset probability” the limitation is an additional element that amounts to adding insignificant extra-solution activity to the judicial exception.  See MPEP § 2106.05(g). Furthermore the additional element is directed to receiving or transmitting data over a network, e.g., using the Internet to gather data, Symantec, 838 F.3d at 1321, 120 USPQ2d. Therefore, the claim is ineligible.

Regarding claim 3, the rejection of claim 2 is incorporated and further:
Step 2A Prong 1: The claim recites, in part: 
“the preset probability is configured to decrease in proportion to a number of slave agents” this encompasses the mental decreasing of a preset probability depending on a number of observed slave agents.
Step 2A Prong 2: The claim does not recite any additional limitations, thus does not further recite any additional elements that integrates the judicial exception into a practical application or amount to significantly more. 

Regarding claim 4, the rejection of claim 2 is incorporated and further:
Step 2A Prong 1: The claim recites, in part:
“update the Q-function matched with the slave agent among the plurality of Q-functions with a Q-function obtained from the slave agent” this encompasses the mental updating of an observed Q-function with a further observed Q-function. Further, this limitation is a mathematical concept. 
Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “the master agent is configured to” the limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2). 
Step 2B: The additional elements, taken individually and in combination, do not provide an inventive concept of significantly more than the abstract idea itself for the reasons set forth in step 2A prong 2 above. Therefore, the claim is ineligible.

Regarding claim 5, the rejection of claim 1 is incorporated and further:
Step 2A Prong 1: The claim recites, in part:
“extract a preset number of Q-functions randomly from among the plurality of Q-functions and learn the extracted Q-functions” this encompasses the random mental extraction of observed Q-functions from among observed Q-functions.
Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “the master agent is configured to” the limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2). 
Step 2B: The additional elements, taken individually and in combination, do not provide an inventive concept of significantly more than the abstract idea itself for the reasons set forth in step 2A prong 2 above. Therefore, the claim is ineligible.

Regarding claim 6, the rejection of claim 1 is incorporated and further:
Step 2A Prong 1: The claim recites, in part:
“perform randomized ensembled double Q-learning based on the data set stored in the second buffer” this limitation is a mathematical concept. 
Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “the master agent is configured to” the limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2). 
Step 2B: The additional elements, taken individually and in combination, do not provide an inventive concept of significantly more than the abstract idea itself for the reasons set forth in step 2A prong 2 above. Therefore, the claim is ineligible.

Regarding claim 7, the rejection of claim 1 is incorporated and further:
Step 2A Prong 1: a continuation of the abstract idea identified in the parent claim. 
Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “the master agent is configured to be installed in a cloud server” the limitation is an additional element that generally links the use of the judicial exception to a particular technological environment or field of use. See MPEP § 2106.05(h).
Step 2B: The additional elements, taken individually and in combination, do not provide an inventive concept of significantly more than the abstract idea itself for the reasons set forth in step 2A prong 2 above. Therefore, the claim is ineligible.

Regarding claim 8, the rejection of claim 1 is incorporated and further:
Step 2A Prong 1: The claim recites, in part:
“perform double Q-learning based on the data set stored in the first buffer” this limitation is a mathematical concept. 
Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “the slave agent is configured to” the limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2). 
Step 2B: The additional elements, taken individually and in combination, do not provide an inventive concept of significantly more than the abstract idea itself for the reasons set forth in step 2A prong 2 above. Therefore, the claim is ineligible.

Regarding claim 9, the rejection of claim 1 is incorporated and further:
Step 2A Prong 1: a continuation of the abstract idea identified in the parent claim. 
Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “the slave agent is configured to be installed in a vehicle terminal” the limitation is an additional element that generally links the use of the judicial exception to a particular technological environment or field of use. See MPEP § 2106.05(h). 
Step 2B: The additional elements, taken individually and in combination, do not provide an inventive concept of significantly more than the abstract idea itself for the reasons set forth in step 2A prong 2 above. Therefore, the claim is ineligible.

Regarding claim 10, the rejection of claim 1 is incorporated and further:
Step 2A Prong 1: The claim recites, in part: 
“the data set includes a state (st) at a time (t), an action (at) selected in the state (st), a reward (rt) for the action (at) , and a new state (st+1) changed by the action (at)” a continuation of the abstract idea identified in the parent claim. 
Step 2A Prong 2: The claim does not recite any additional limitations, thus does not further recite any additional elements that integrates the judicial exception into a practical application or amount to significantly more. 

Regarding claim 11:
Step 1: Claim 1 is directed to A multi-agent-based reinforcement learning method, therefore it falls under the statuary category of a process.
Step 2A Prong 1: The claim recites, in part:
“learning…a Q-function based on the data set stored in the first buffer” this encompasses the mental learning of a Q-function based on an observed data set. Further, this limitation is a mathematical concept.
“updating…a Q-function matched with the slave agent among a plurality of Q-functions” this encompasses the mental updating of an observed Q-function. Further, this limitation is a mathematical concept.
“performing…reinforcement learning based on the data set stored in the second buffer” this limitation is a mathematical concept.
Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “storing…a data set collected in each state of a second environment in a second buffer”, “transmitting…the data set to a slave agent”, “storing…a data set collected in each state of a first environment and the data set received from the master agent in a first buffer” these limitations are an additional element that amounts to adding insignificant extra-solution activity to the judicial exception.  See MPEP § 2106.05(g). “by a master agent”, “by the master agent” in line 14, “by the slave agent” in line 16, “by the slave agent” in line 19, “by the master agent” in line 21, “by the master agent  in line 23 the limitations are an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2). 
Step 2B: The additional elements “by a master agent”, “by the master agent” in line 14, “by the slave agent” in line 16, “by the slave agent” in line 19, “by the master agent” in line 21, “by the master agent  in line 23, taken individually and in combination, do not provide an inventive concept of significantly more than the abstract idea itself for the reasons set forth in step 2A prong 2 above. Furthermore, “storing…a data set collected in each state of a second environment in a second buffer”, “transmitting…the data set to a slave agent”, “storing…a data set collected in each state of a first environment and the data set received from the master agent in a first buffer” these limitations are an additional element that amounts to adding insignificant extra-solution activity to the judicial exception.  See MPEP § 2106.05(g). Furthermore the additional element is directed to receiving or transmitting data over a network, e.g., using the Internet to gather data, Symantec, 838 F.3d at 1321, 120 USPQ2d. As well as storing and retrieving information in memory, Versata Dev. Group, Inc. v. SAP Am., Inc., 793 F.3d 1306, 1334, 115 USPQ2d 1681, 1701 (Fed. Cir. 2015). See MPEP § 2106.05(d)/(II). Therefore, the claim is ineligible. 

Regarding claims 12-18:
The rejection of claim 11 is further incorporated, the rejection of claims 1-6, 8 and 10 are applicable to claims 12-18, respectively.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-7, 9-16 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Hu et al. (US 20230214725 A1) hereinafter Hu in view of Kunert et al. (“Accelerating Reinforcement Learning for Robot Controls Using Interim Rewards and Master/Slave Computing”, 2019) hereinafter Kunert in view of Chen et al. (“Randomized Ensembled Double Q-Learning: Learning Fast Without A Model”, Chen et al., 18 March 2021), as cited in the IDS, hereinafter Chen in view of Kong et al. (“Revisiting The Master-Slave Architecture In Multi-Agent Deep Reinforcement Learning”, Kong et al., 20 December 2017) hereinafter Kong.

Regarding claim 1:
Hu teaches A multi-agent-based reinforcement learning system comprising (Hu, claim 1 “A method for training multiple reinforcement learning (RL) agents deployed in a shared environment, the method comprising”):
store a data set received from a master agent in the first buffer (Hu, ¶84 “The method further includes updating 815 the global policy based on sampled experience tuples drawn from the shared experience replay buffer and distributing 820 the updated global policy to the multiple RL agents, wherein each of the multiple RL agents behave based at least in part on the updated global policy in the next episode.”), and
store a data set collected in each state of a second environment in a second buffer (Hu, ¶57 “The experience tuple of each agent is saved on a shared experience replay buffer.” Here, each agents environment can be considered a second environment, and the shared experience replay buffer can be considered the second buffer), 
perform reinforcement learning based on the data set stored in the second buffer (Hu, ¶37 “The shared experience replay buffer is utilized to train a single global unified policy to which each agent adheres when each RL agent interacts in the environment.”).
	Hu does not teach “a slave agent configured to:
learn a Q-function based on the data set stored in the first buffer;
update a Q-function matched with the slave agent among a plurality of Q-functions;”
However, Kunert teaches a slave agent configured to (Kunert, page 1, abstract “In this regard, we revisit the idea of the master-slave architecture by incorporating both perspectives within one framework”):
learn a Q-function based on the data set stored in the first buffer (Kunert, page 2, col 1, ¶3 “Then, the agent learns with (1) and accumulates its knowledge in the Q-matrix. Q(st , a) = r(st, a) + γ ∗ (max(Q(st+1, ∀a ∈ A))) (1)”); 
update a Q-function matched with the slave agent among a plurality of Q-functions (Kunert, page “After completion the merging process, AM contains the knowledge of all slave agents. Using the data structures of AM, a Q-matrix for AM is computed by means of the post optimization according to section II. Thus, a behavior strategie a = Q(st) can be derived, which has to be transformed into a robot control as described in [2].”)
Hu and Kunert are analogous art because both references concern methods for multi-agent reinforcement learning. Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Hu’s multi-agent reinforcement learning system to incorporate the slave agents taught by Kunert. The motivation for doing so would have been to accelerate learning time as stated in Kunert, age 3, col 2, section V, ¶1 “To accelerate the learning time, a method for distributed learning with a master/slave approach is introduced.”.
Hu in view of Kunert does not teach “store a data set collected in each state of a first environment in a first buffer” 
However, Chen teaches store a data set collected in each state of a first environment in a first buffer (Chen, Page 3, algorithm 1, line 4 “Add data to buffer: D ← D ∪ {(st, at, rt, st+1)}”),
Hu in view of Kunert and Chen are analogous art because both references concern methods for reinforcement Q-learning. Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Hu/Kunert’s multi-agent reinforcement learning system to incorporate the model free environment taught by Chen. The motivation for doing so would have been to attain the performance of model-based algorithms as stated in Chen, page 1, section 1, ¶2 “The result indicates, that at least for the MuJoCo benchmark, simple model-free algorithms can attain the performance of current state-of-the-art model-based algorithms.”
Hu in view of Kunert in further view of Chen does not teach “the master agent configured to:
transmit the data set to the slave agent” 
However, Kong teaches the master agent configured to (Kong, page 4, section 3, ¶1 “We start by reiterating that the key idea is to facilitate both an explicit master controller that takes the centralized perspective and organize agents in a global or high level manner and all actual slave controllers work as the decentralized agents and optimize their specific actions relatively locally while depending on information from the master controller.”): 
transmit the data set to the slave agent (Kong, page 4, section 3.1, ¶2 “Meanwhile the master agent also take as input some information from each slave agent ci and broadcasts back its action output to all agents to help forming their final actions.”),
Hu in view of Kunert in further view of Chen and Kong are analogous art because both references concern methods for multi-agent reinforcement learning with a master/slave architecture. Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Hu/Kunert/Chen’s multi-agent reinforcement learning to incorporate the transmission of data taught by Kong. The motivation for doing so would have been to have more global states for training as stated in Kong, page 4, section 3.1, ¶1 “This design has two benefits: 1) one can now input independent and potentially more global states to the master agent…”.

Regarding claim 2:
Hu in view of Kunert in view of Chen in further view of Kong teaches The multi-agent-based reinforcement learning system of claim 1, wherein the master agent is configured to transmit the data set to the slave agent with a preset probability (Hu, ¶50 “In order to utilize more unexpected samples, each experience tuple sample in the experience replay buffer is characterized in terms of a sampling probability based on its unexpectedness.” Here, the sampling probability can be considered the a preset probability).

Regarding claim 3:
Hu in view of Kunert in view of Chen in further view of Kong teaches The multi-agent-based reinforcement learning system of claim 2, wherein the preset probability is configured to decrease in proportion to a number of slave agents (Hu, ¶50 “In order to utilize more unexpected samples, each experience tuple sample in the experience replay buffer is characterized in terms of a sampling probability based on its unexpectedness.” Here, as the number of agents increases, the sampling probability of any experience tuple will decrease).

Regarding claim 4:
Hu in view of Kunert in view of Chen in further view of Kong teaches The multi-agent-based reinforcement learning system of claim 2, wherein the master agent is configured to update the Q- function matched with the slave agent among the plurality of Q- functions with a Q-function obtained from the slave agent  (Kunert, page “After completion the merging process, AM contains the knowledge of all slave agents. Using the data structures of AM, a Q-matrix for AM is computed by means of the post optimization according to section II. Thus, a behavior strategie a = Q(st) can be derived, which has to be transformed into a robot control as described in [2].”).
Hu and Kunert are analogous art because both references concern methods for multi-agent reinforcement learning. Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Hu’s multi-agent reinforcement learning system to incorporate the matched Q-functions taught by Kunert. The motivation for doing so would have been to accelerate learning time as stated in Kunert, age 3, col 2, section V, ¶1 “To accelerate the learning time, a method for distributed learning with a master/slave approach is introduced.”.

Regarding claim 5:
Hu in view of Kunert in view of Chen in further view of Kong teaches The multi-agent-based reinforcement learning system of claim 1, wherein the master agent is configured to extract a preset number of Q-functions randomly from among the plurality of Q-functions and learn the extracted Q-functions (Chen, page 3, algorithm 1, lines 7-8 “Sample a set M of M distinct indices from {1, 2, . . . , N} 
Compute the Q target y (same for all of the N Q-functions):

    PNG
    media_image1.png
    69
    608
    media_image1.png
    Greyscale
”).
Hu in view of Kunert and Chen are analogous art because both references concern methods for reinforcement Q-learning. Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Hu/Kunert’s multi-agent reinforcement learning system to incorporate the Q-function extraction taught by Chen. The motivation for doing so would have been to attain the performance of model-based algorithms as stated in Chen, page 1, section 1, ¶2 “The result indicates, that at least for the MuJoCo benchmark, simple model-free algorithms can attain the performance of current state-of-the-art model-based algorithms.”

Regarding claim 6:
Hu in view of Kunert in view of Chen in further view of Kong teaches The multi-agent-based reinforcement learning system of claim 1, wherein the master agent is configured to perform randomized ensembled double Q-learning based on the data set stored in the second buffer (Chen, page 2, section 2, ¶2 “We propose Randomized Ensembled Double Q-learning (REDQ), a novel model-free algorithm whose sample-efficiency performance is just as good as, if not better than, the state-of-the-art modelbased algorithm for the MuJoCo benchmark. The pseudocode for REDQ is shown in Algorithm 1. REDQ can be used with any standard off-policy model-free algorithm, such as SAC (Haarnoja et al., 2018b), SOP (Wang et al., 2019), TD3 (Fujimoto et al., 2018), or DDPG (Lillicrap et al., 2015).”).
It would have been obvious to combine the teachings of Hu/Kunert/Kong and Chen for the reasons set forth in connection with claim 1 above.

Regarding claim 7:
Hu in view of Kunert in view of Chen in further view of Kong teaches The multi-agent-based reinforcement learning system of claim 1, wherein the master agent is configured to be installed in a cloud server (Kong, page 2, ¶3 “One can consider the master agent as the central control of some organized traffic systems and the slave agents as each actual vehicles” here, the central control can be considered the cloud server).
Hu in view of Kunert in further view of Chen and Kong are analogous art because both references concern methods for multi-agent reinforcement learning with a master/slave architecture. Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Hu/Kunert/Chen’s multi-agent reinforcement learning to incorporate the client/server architecture for vehicles taught by Kong. The motivation for doing so would have been to have incorporate the system for real world applications as stated in Kong, page 2, ¶3 “That is, the master agent tends to plan in a global manner without focusing on potentially distracting details from each slave agent and meanwhile the slave agents often locally optimize their actions with respect to both their local state and the guidance coming from the master agent. Such idea can be well motivated from many real world systems. One can consider the master agent as the central control of some organized traffic systems and the slave agents as each actual vehicles.”.

Regarding claim 9:
Hu in view of Kunert in view of Chen in further view of Kong teaches The multi-agent-based reinforcement learning system of claim 1, wherein the slave agent is configured to be installed in a vehicle terminal (Kong, page 2, ¶3 “One can consider the master agent as the central control of some organized traffic systems and the slave agents as each actual vehicles”).
Hu in view of Kunert in further view of Chen and Kong are analogous art because both references concern methods for multi-agent reinforcement learning with a master/slave architecture. Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Hu/Kunert/Chen’s multi-agent reinforcement learning to incorporate the client/server architecture for vehicles taught by Kong. The motivation for doing so would have been to have incorporate the system for real world applications as stated in Kong, page 2, ¶3 “That is, the master agent tends to plan in a global manner without focusing on potentially distracting details from each slave agent and meanwhile the slave agents often locally optimize their actions with respect to both their local state and the guidance coming from the master agent. Such idea can be well motivated from many real world systems. One can consider the master agent as the central control of some organized traffic systems and the slave agents as each actual vehicles.”.

Regarding claim 10:
Hu in view of Kunert in view of Chen in further view of Kong teaches The multi-agent-based reinforcement learning system of claim 1, wherein the data set includes a state (st) at a time (t), an action (at) selected in the state (st), a reward (rt) for the action (at) , and a new state (st+1) changed by the action (at) (Chen, Page 3, algorithm 1, line 4 “Add data to buffer: D ← D ∪ {(st, at, rt, st+1)}”).
It would have been obvious to combine the teachings of Hu/Kunert/Kong and Chen for the reasons set forth in connection with claim 1 above. 

Regarding claim 11:
Hu teaches A multi-agent-based reinforcement learning system comprising (Hu, claim 1 “A method for training multiple reinforcement learning (RL) agents deployed in a shared environment, the method comprising”):
storing, by a master agent, a data set collected in each state of a second environment in a second buffer (Hu, ¶57 “The experience tuple of each agent is saved on a shared experience replay buffer.” Here, each agents environment can be considered a second environment, and the shared experience replay buffer can be considered the second buffer), 
storing, by the slave agent, a data set collected in each state of a first environment and the data set received from the master agent in a first buffer (Hu, ¶84 “The method further includes updating 815 the global policy based on sampled experience tuples drawn from the shared experience replay buffer and distributing 820 the updated global policy to the multiple RL agents, wherein each of the multiple RL agents behave based at least in part on the updated global policy in the next episode.”), and
performing, by the master agent, reinforcement learning based on the data set stored in the second buffer (Hu, ¶37 “The shared experience replay buffer is utilized to train a single global unified policy to which each agent adheres when each RL agent interacts in the environment.”).
	Hu does not teach “learning, by the slave agent, a Q-function based on the data set stored in the first buffer;
updating, by the master agent, a Q-function matched with the slave agent among a plurality of Q-functions;”
However, Kunert teaches learning, by the slave agent, a Q-function based on the data set stored in the first buffer (Kunert, page 2, col 1, ¶3 “Then, the agent learns with (1) and accumulates its knowledge in the Q-matrix. Q(st , a) = r(st, a) + γ ∗ (max(Q(st+1, ∀a ∈ A))) (1)”); 
updating, by the master agent, a Q-function matched with the slave agent among a plurality of Q-functions (Kunert, page “After completion the merging process, AM contains the knowledge of all slave agents. Using the data structures of AM, a Q-matrix for AM is computed by means of the post optimization according to section II. Thus, a behavior strategie a = Q(st) can be derived, which has to be transformed into a robot control as described in [2].”)
Hu and Kunert are analogous art because both references concern methods for multi-agent reinforcement learning. Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Hu’s multi-agent reinforcement learning system to incorporate the slave agents taught by Kunert. The motivation for doing so would have been to accelerate learning time as stated in Kunert, age 3, col 2, section V, ¶1 “To accelerate the learning time, a method for distributed learning with a master/slave approach is introduced.”.
Hu in view of Kunert does not teach “storing, by the slave agent, a data set collected in each state of a first environment and the data set received from the master agent in a first buffer” 
However, Chen teaches storing, by the slave agent, a data set collected in each state of a first environment and the data set received from the master agent in a first buffer (Chen, Page 3, algorithm 1, line 4 “Add data to buffer: D ← D ∪ {(st, at, rt, st+1)}”),
Hu in view of Kunert and Chen are analogous art because both references concern methods for reinforcement Q-learning. Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Hu/Kunert’s multi-agent reinforcement learning system to incorporate the model free environment taught by Chen. The motivation for doing so would have been to attain the performance of model-based algorithms as stated in Chen, page 1, section 1, ¶2 “The result indicates, that at least for the MuJoCo benchmark, simple model-free algorithms can attain the performance of current state-of-the-art model-based algorithms.”
Hu in view of Kunert in further view of Chen does not teach “transmitting, by the master agent, the data set to a slave agent” 
However, Kong teaches transmitting, by the master agent, the data set to a slave agent (Kong, page 4, section 3.1, ¶2 “Meanwhile the master agent also take as input some information from each slave agent ci and broadcasts back its action output to all agents to help forming their final actions.”),
Hu in view of Kunert in further view of Chen and Kong are analogous art because both references concern methods for multi-agent reinforcement learning with a master/slave architecture. Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Hu/Kunert/Chen’s multi-agent reinforcement learning to incorporate the transmission of data taught by Kong. The motivation for doing so would have been to have more global states for training as stated in Kong, page 4, section 3.1, ¶1 “This design has two benefits: 1) one can now input independent and potentially more global states to the master agent…”.

Regarding claims 12-16 and 18:
	Claims 12-16 and 18 are rejected under the same rationale as claims 2-6 and 10, respectively.

Claims 8 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Hu in view of Kunert in view of Chen in view of Kong in further view of Hasselt (“Double Q-learning”, Hasert, 2010).

Regarding claim 8:
Hu in view of Kunert in view of Chen in further view of Kong teaches The multi-agent-based reinforcement learning system of claim 1, 
Hu in view of Kunert in view of Chen in further view of Kong does not teach “wherein the slave agent is configured to perform double Q-learning based on the data set stored in the first buffer” 
However, Hasselt teaches wherein the slave agent is configured to perform double Q-learning based on the data set stored in the first buffer (Hasselt, page 5, algorithm 1 

    PNG
    media_image2.png
    332
    808
    media_image2.png
    Greyscale
).
Hu in view of Kunert in view of Chen in further view of Kong and Hasselt are analogous art because both references concern methods for Q-learning. Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Hu/Kunert/Chen/Kong’s multi-agent Q-learning system to incorporate the double Q-learning taught by Hasselt. The motivation for doing so would have been to converge to the optimal policy as stated in Hasselt, page 4, section 3.1, ¶1 “In this subsection we show that in the limit Double Q-learning converges to the optimal policy.”

Regarding claim 17:
	Claim 17 is rejected under the same rationale as claim 8.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Xue et al. (“A Resource-Constrained and Privacy-Preserving Edge-Computing-Enabled Clinical Decision System: A Federated Reinforcement Learning Approach”, 8 February 2021) teaches a federated framework (FDFF) for aggregating models of DDQN and extracting the knowledge from EMRs across all ENs.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to JACOB Z SUSSMAN MOSS whose telephone number is (571) 272-1579. The examiner can normally be reached Monday - Friday, 9 a.m. - 5 p.m. ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on (571) 272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.








/J.S.M./Examiner, Art Unit 2122  

/KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122
Read full office action
MULTI-AGENT-BASED REINFORCEMENT LEARNING SYSTEM AND METHOD THEREFOR

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

MULTI-AGENT-BASED REINFORCEMENT LEARNING SYSTEM AND METHOD THEREFOR

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email