Notice of AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 12/11/2025 has been entered.
Response to Amendment
Applicant’s Amendment and remarks dated 12/11/2025 have been considered. Claims 17-18 are cancelled. Claims 1-16 and 19-20 are pending.
35 U.S.C. 101 Rejections. The rejections to all pending claims under 35 U.S.C. 101 has been withdrawn. See “Response to Arguments” section.
Response to Arguments
On pages 9-12 of Applicant’s 12/11/2025 Amendment and remarks, with respect to the rejections under 35 U.S.C. 101, with respect to Step 2A, Prong 2, Applicant argues that as amended, the claims pertain to improvements. In particular, claim 1 is directed to improvements to “computerized distributed machine learning” with respect to cloud computing and claims 19-20 are directed to improvements to “computerized distributed machine learning” with respect to autonomous driving.
The examiner finds Applicant’s arguments to be persuasive. In particular, as noted by Applicant, paras. 0017-0018 describe improvements with respect to decentralized machine learning, including “linear time performance increase resulting from the increase in computational resources.” These improvements are reflected in the newly-added claim limitations, which explains that each agent resides on a computing node (claim 1) or vehicle (claims 19-20), and cooperates to perform tasks subject to specific constraints, in order to achieve the recited “team award” subject to the constraints. Therefore, the independent claims are found to be subject matter-eligible, and all rejections under 35 U.S.C. 101 are withdrawn.
On page 12 of Applicant’s 12/11/2025 Amendment and remarks, Applicant asserts that the claim amendments are supported by original claims 17 and 18 and paras. 00106-00107 of the instant specification.
The examiner agrees that the portions of the disclosure identified by Applicant provide sufficient written description support for the claim amendments.
On page 12 of Applicant’s 12/11/2025 Amendment and remarks, with respect to the rejections of claims 1-16 and 19-20 under 35 U.S.C. 102 as anticipated by the LU reference, Applicant argues that subject rejections are overcome by the claim amendments.
The examiner agrees. All rejections under 35 U.S.C. 102 are hereby withdrawn. However, new grounds of rejections under 35 U.S.C. 103 are provided herein.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-16 are rejected under 35 U.S.C. 103 as being unpatentable over Lu, Songtao, et al. "Decentralized Policy Gradient Descent Ascent for Safe Multi-Agent Reinforcement Learning." (2021), pp. 8767-8775, hereinafter referenced as LU, in view of US 20210007023 A1, hereinafter referenced as UMAPATHY, and further in view of US 20210405727 A1, hereinafter referenced as SINGH.
Regarding Claim 1
LU teaches:
A reinforcement learning system, comprising: (LU, p. 8768, “Main Contributions” section: “we propose the first safe decentralized policy gradient (PG) descent and ascent algorithm, i.e., Safe Dec-PG, which is able to deal with a class of multi-agent safe RL problems over a graph.”; Examiner’s Note (EN): page 8767, Introduction, explains that “RL” is an acronym for reinforcement learning)
a plurality of agents, each agent implemented in software, having an individual reward function and one or more safety constraints that involve joint actions of the agents, (LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section: “Consider a team of n agents operating in a common environment, denoted by N = [n]. No central controller exists to either make the decisions or collect any information for the agents. ... each agent has some safety constraints, in the forms of bounds on some long term cost, that involve the joint policy of all agents. ... Each agent i only has access to its own reward”;
LU, p. 8772: “Numerical Results” section: “To show the performance of safe decentralized RL, we test our algorithm .... on the popular OpenAI Gym paradigm. ... The experiments were run on the NVIDIA Tesla V100 GPU with 32GB memory.”;
(EN): Definition 1 on p. 8769 further discloses that each agent has a local reward function; the examiner notes that one of ordinary skill would understand that operating the experiments on a GPU requires software code for the GPU to execute, corresponding to recited “implemented in software” limitation)
wherein each agent maximizes a team-average long-term return in performing the joint actions, subject to the safety constraints, (LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section: “As a team, the objective of all agents is to collaboratively maximize the globally average return over the network ... with only its local observations of the rewards, subject to some safety constraints”)
and participates in decentralized machine learning and in operating the physical system. (LU, p. 8768, “Main Contributions” section: “In this work, by leveraging the min-max saddle-point formulation, we propose the first safe decentralized policy gradient (PG) descent and ascent algorithm, i.e., Safe Dec-PG, which is able to deal with a class of multi-agent safe RL problems over a graph.”
LU, p. 8772: “Numerical Results” section: “The experiments were run on the NVIDIA Tesla V100 GPU with 32GB memory.”;
(EN): Fig. 1a discloses the network structure of a 5-agent system, where such system is implemented on the NVIDIA Tesla V100 GPU, corresponding to a “physical system”; decentralized reinforcement learning (RL) (see definition on page 8767) corresponds to recited “decentralized machine learning” limitation)
a peer-to-peer communication network configured to connect the plurality of agents and configured to implement the decentralized machine learning, each agent running on a corresponding node of the peer-to-peer communication network; and (LU, p. 8767, Abstract: “the agents are connected by a peer-to-peer communication network to share information with their neighbors”;
LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP) section: “Consider a team of n agents operating in a common environment,
denoted by N = [n]. No central controller exists to either make the decisions or collect any information for the agents. Agents are instead allowed to communicate with each other over a communication network G = (N; E), with E being the set of communication links that connect the agents. Such a decentralized model with networked agents finds broad applications in distributed cooperative control problems ... and has been advocated as one of the most popular paradigms in decentralized MARL. ... At each node, there would be multiple safety constraints.”;
LU, pp. 8772-73, “Numerical Results” section: “all agents are connected by a well-connected graph as shown in Figure 1(a), where every agent can only exchange their parameters θi with its neighbors through the communication channel (denoted by the green lines).”;
(EN): LU explains that communication network G has nodes for each agent, connected by edges (or communication links) E, corresponding to recited “each agent running on a corresponding node of the peer-to-peer communication network” limitation and the decentralized MARL disclosure corresponds to the recited “configured to implement the decentralized machine learning” limitation.
a distributed constrained Markov decision process (D-CMDP) model implemented over the peer-to-peer communication network (LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section:
PNG
media_image1.png
258
472
media_image1.png
Greyscale
(EN): The M-CMDP of LU is a “distributed” CMDP because its distributed across the different agents in the communications network)
and configured to perform policy optimization using a decentralized policy gradient (PG) method, (LU, p. 8767, Abstract: “In this work, we first formulate this problem as a distributed constrained Markov decision process (D-CMDP) with networked agents. Then, we propose a decentralized policy gradient (PG) method, Safe Dec-PG, to perform policy optimization based on this D-CMDP model over a network.”; LU, p. 8770, “Safe Policy Gradient” section: “The search for an optimal policy can thus be performed by applying the gradient descent-type iterative methods to the parametrized optimization problem.”; LU, p. 8771, “Safe Dec-PG: Safe Decentralized Policy Gradient” section, Algorithm 1)
wherein the participation of each agent in operating the physical system is based on the D-CMDP model; (LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section:
PNG
media_image1.png
258
472
media_image1.png
Greyscale
(EN): The M-CMDP of LU is a “distributed” CMDP because its distributed across the different agents in the communications network)
wherein: each agent resides on one of the computing nodes and each computing node is configured to operate in accordance with the corresponding distributed constrained Markov decision process (D-CMDP) model (LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section:
PNG
media_image1.png
258
472
media_image1.png
Greyscale
(EN): The M-CMDP of LU is a “distributed” CMDP because its distributed across the different agents in the communications network, and each agent pertains to a “node” as explained on p. 8769)
each computing node performs computing tasks subject to constraints including..., and data storage capacity; (LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section: “At each node, there would be multiple safety constraints. These rewards describe different objectives that the agent is required to achieve, such as remaining with a region of the state space, or not running out of memory/battery”;
(EN): LU discloses that each computing node has constraints, including not running out of memory (corresponding to recited “data storage capacity” constraints)
However, LU fails to explicitly teach:
a physical system comprising a plurality of computing nodes configured as a cloud computing environment
local computing capacity, inter-node communication capacity
a team reward aims to see that all computing tasks are performed in the shortest amount of time subject to the constraints.
However, in a related field of endeavor (optimizing communications networks using reinforcement learning, see para. 0063), UMAPATHY discloses:
a physical system comprising a plurality of computing nodes configured as a cloud computing environment (UMAPATHY, para. 0043: “The cloud 144 may represent a cloud computing architecture/platform that provides one or more cloud computing services. Cloud computing refers to a paradigm for enabling network access to a scalable and elastic pool of shareable computing resources with self-service provisioning and administration on-demand and without active management by users.”;
(EN): in combination with LU, the different agents of LU (see Fig. 1(a)) are each implemented on a separate node in a cloud computing environment as in UMAPATHY)
each computing node performs computing tasks subject to constraints including local computing capacity, inter-node communication capacity, and data storage capacity; (UMAPATHY, para. 0083: “This may involve, for example, evaluating both computation and communication resources needed for different data transfers, threshold criteria or a desired level of reliability, amount or type of compute node capabilities (e.g., a certain processor speed, memory/storage capacity, RAT circuitries implemented by the computing system 201, etc.), type of operational constraints under which the computing system 201 is operating (e.g., radio conditions and/or link quality, a surrounding environmental conditions (e.g., temperature, humidity, altitude, etc.), component temperature, etc.), and/or the like.”;
(EN): in combination with LU, each computing node is subject to constraints including compute node capabilities, communication resources needed, and storage capacity as taught by UMAPATHY; the examiner notes that one of ordinary skill would understand that node resources are not unlimited, and that any computing node would naturally have limitations related to processing power (e.g., limited by the number of processors and their processing speeds), communication capacity (e.g., limited by input/output ports and devices), and storage capacity (e.g., limited by actual storage constraints))
Before the effective filing date of the present application, it would have been obvious to combine the Safe Dec-PG system of LU with the teachings of UMAPATHY as explained above. As disclosed by UMAPATHY, one of ordinary skill would have been motivated to do so because UMAPATHY teaches that an “advantage” of using a reinforcement learning system is to help “tailor” “recommendations to suit the user's behavior and preferences, and/or suit the capabilities and/or operational constraints of the compute node.” (para. 0109). UMAPATHY further discloses the benefit of extending cloud services to fog services for IoT devices. (para. 0024).
However, LU and UMAPATHY fail to explicitly teach:
a team reward aims to see that all computing tasks are performed in the shortest amount of time subject to the constraints.
However, in a related field of endeavor (optimizing a system using reinforcement learning techniques, see para. 0014), SINGH teaches and makes obvious:
a team reward aims to see that all computing tasks are performed in the shortest amount of time subject to the constraints. (SINGH, para. 0072: “Based on the environment variables reported by the environment aggregator 515, the reinforcement learning logic center 519 takes optimal actions in order to maximize the aggregated value of the joint-reward function discussed in further detail below.”;
SINGH, para. 0075: “The joint reward function utilized by the performance monitor 517 will now be described. The joint-reward is calculated by the performance monitor 517 using components of the data center 500 environment that represent quantitatively the attributes that need to be increased or decrease (e.g., maximized or minimized) in order to achieve improved performance (e.g., optimality) of the data center 500.”;
(EN): SINGH discloses a joint-reward (corresponding to recited “team award”) so that optimal data center performance is achieved; the LU-UMAPATHY-SINGH combination now modifies the Safe Dec-PG system of LU to apply to the cloud computing environment of UMAPATHY, where there are constraints with respect to computing power, communication resources, and storage capacity as taught by UMAPATHY, and then uses a joint-reward as in SINGH to optimize to the benchmark of shortest amount of time, where such optimization is disclosed by SINGH).
Before the effective filing date of the present application, it would have been obvious to combine the Safe Dec-PG system of LU with the cloud computing teachings of UMAPATHY and SINGH as explained above. As disclosed by SINGH, one of ordinary skill would have been motivated to do so in order to use the techniques of SINGH to “maintain[] specified performance benchmarks utilizing a reinforcement learning framework.” (para. 0014). One of ordinary skill would further understand that the benefit of optimizing performance benchmarks can be applied to benchmarks other than power consumption, including task execution speed, and would have been motivated to seek faster task processing speeds in the data center environment of SINGH, which discloses that compute resources need to be managed and distributed. (para. 0002).
Regarding Claim 2
LU, UMAPATHY, and SINGH teach the system of claim 1. LU further teaches:
wherein the individual reward function and the safety constraints are known to the corresponding agent and unknown to the remaining agents. (LU, p. 8767, Abstract: “both the rewards and constraints are only known to each agent locally/privately.”; LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section: “The states s and actions a are globally observable, while the rewards and costs are observed locally/privately at each agent.”)
Regarding Claim 3
LU, UMAPATHY, and SINGH teach the system of claim 1. LU further teaches:
wherein each agent is configured to explore interactions with an environment to maximize a cumulative reward through a reinforcement learning process. (LU, p. 8767, Introduction, “where an agent explores the interactions with an environment so that it is able to maximize a cumulative reward through this learning process.”)
Regarding Claim 4
LU, UMAPATHY, and SINGH teach the system of claim 1. LU further teaches:
wherein the safety constraints are in a form of bounds on a long term cost associated with a joint policy of the agents. (LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section:
PNG
media_image2.png
386
468
media_image2.png
Greyscale
(EN): the lower-bounds of the negative team-average discounted long-term return denotes the “costs” of an agent, and is related to the joint policy of the agents as explained above)
Regarding Claim 5
LU, UMAPATHY, and SINGH teach the system of claim 1. LU further teaches:
wherein each agent is described by a tuple
PNG
media_image3.png
28
252
media_image3.png
Greyscale
where S comprises a state space shared by the agents, a graph G represents the peer-to-peer communication network,
PNG
media_image4.png
24
92
media_image4.png
Greyscale
comprises a joint action space of the agents,
PNG
media_image5.png
28
288
media_image5.png
Greyscale
comprise local rewards and cost functions of corresponding agent i, P : S x A x S [Wingdings font/0xE0] [0,1] comprises a state transition probability of the Markov decision process, and
PNG
media_image6.png
20
28
media_image6.png
Greyscale
(0,1) denotes a discount factor. (LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section:
PNG
media_image1.png
258
472
media_image1.png
Greyscale
)
Regarding Claim 6
LU, UMAPATHY, and SINGH teach the system of claim 5. LU further teaches:
an objective of the agents is to collaboratively maximize a globally average return over the peer-to-peer communication network, dictated by
PNG
media_image7.png
28
204
media_image7.png
Greyscale
, with only local observations of the local rewards, subject to the corresponding safety constraints dictated by
PNG
media_image8.png
26
54
media_image8.png
Greyscale
, and wherein each agent is associated with m of the cost functions and
PNG
media_image8.png
26
54
media_image8.png
Greyscale
comprises a mapping S x A to
PNG
media_image9.png
24
32
media_image9.png
Greyscale
(LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section: “As a team, the objective of all agents is to collaboratively maximize the globally average return over the network (equivalently to minimize the opposite of it), dictated by
PNG
media_image7.png
28
204
media_image7.png
Greyscale
, with only its local observations of the rewards, subject to some safety constraints dictated by
PNG
media_image8.png
26
54
media_image8.png
Greyscale
. ... Here, we assume that each agent is associated with m cost functions, so
PNG
media_image8.png
26
54
media_image8.png
Greyscale
is a mapping
from S x A to
PNG
media_image9.png
24
32
media_image9.png
Greyscale
)
Regarding Claim 7
LU, UMAPATHY, and SINGH teach the system of claim 5. LU further teaches:
wherein the local rewards describe different objectives that the corresponding agent is to achieve. (LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section: “These rewards describe different objectives that the agent is required to achieve”)
Regarding Claim 8
LU, UMAPATHY, and SINGH teach the system of claim 5. LU further teaches:
wherein, at time t, each agent i selects an action
PNG
media_image10.png
22
20
media_image10.png
Greyscale
given a state
PNG
media_image11.png
24
24
media_image11.png
Greyscale
according to a corresponding local policy,
PNG
media_image12.png
22
110
media_image12.png
Greyscale
, which is parametrized as
PNG
media_image13.png
22
348
media_image13.png
Greyscale
PNG
media_image14.png
84
588
media_image14.png
Greyscale
(LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section:
PNG
media_image15.png
184
482
media_image15.png
Greyscale
)
Regarding Claim 9
LU, UMAPATHY, and SINGH teach the system of claim 5. LU further teaches:
wherein the joint policy πθ is:
PNG
media_image16.png
234
600
media_image16.png
Greyscale
(LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section:
PNG
media_image17.png
364
474
media_image17.png
Greyscale
)
Regarding Claim 10
LU, UMAPATHY, and SINGH teach the system of claim 6. LU further teaches:
PNG
media_image18.png
190
594
media_image18.png
Greyscale
PNG
media_image19.png
114
410
media_image19.png
Greyscale
PNG
media_image20.png
158
404
media_image20.png
Greyscale
(LU, p. 8770, “Safe Policy Gradient” section:
PNG
media_image21.png
320
472
media_image21.png
Greyscale
)
Regarding Claim 11
LU, UMAPATHY, and SINGH teach the system of claim 1. LU further teaches:
wherein parameters of the individual reward function for each agent are updated by:
PNG
media_image22.png
154
608
media_image22.png
Greyscale
(LU, p. 8771, “Safe Dec-PG: Safe Decentralized Policy Gradient” section:
PNG
media_image23.png
326
478
media_image23.png
Greyscale
)
Regarding Claim 12
LU, UMAPATHY, and SINGH teach the system of claim 11. LU further teaches:
wherein the variable
PNG
media_image24.png
28
22
media_image24.png
Greyscale
is updated locally as:
PNG
media_image25.png
32
404
media_image25.png
Greyscale
PNG
media_image26.png
32
118
media_image26.png
Greyscale
(LU, p. 8771, “Safe Dec-PG: Safe Decentralized Policy Gradient” section:
PNG
media_image27.png
206
476
media_image27.png
Greyscale
)
Regarding Claim 13
LU, UMAPATHY, and SINGH teach the system of claim 12. LU further teaches:
wherein the update is based on:
PNG
media_image28.png
40
550
media_image28.png
Greyscale
PNG
media_image29.png
44
592
media_image29.png
Greyscale
(LU, p. 8771, “Safe Dec-PG: Safe Decentralized Policy Gradient” section:
PNG
media_image30.png
140
476
media_image30.png
Greyscale
)
Regarding Claim 14
LU, UMAPATHY, and SINGH teach the system of claim 1. LU further teaches:
wherein rewards considered in an individual reward function of the agents are based on a distance between a location of an agent and its desired landmark and a minimum distance between two agents of the plurality of agents. (LU, p. 8773, “Environment” section: “The rewards considered in the objective function include two parts: i) the first one is based on the distance between the location of the node to its desired landmark, which is a monotonically decreasing function of the distance, (i.e., the smaller the distance, the higher the reward will be); ii) the second one is determined by the minimum distance between two agents.”)
Regarding Claim 15
LU, UMAPATHY, and SINGH teach the system of claim 1. LU further teaches:
wherein two of the plurality of agents are penalized by a negative reward value in response to a distance between the two agents being lower than a defined threshold. (LU, p. 8773, “Environment” section: “If the distance between two agents is lower than a threshold, then we consider that a collision happens, and both of the agents will be penalized by a large negative reward value, i.e., -1.”)
Regarding Claim 16
LU, UMAPATHY, and SINGH teach the system of claim 15. LU further teaches:
wherein a reward at one of the two agents is scaled by positive coefficients representing a heterogeneity of different agents. (LU, p. 8773, “Environment” section: “the reward at each agent is further scaled by different positive coefficients, representing the heterogeneity, e.g., priority levels, of different agents.”
Claims 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over LU in view of US 20210065547 A1, hereinafter referenced as RAN, and further in view of SINGH.
Regarding Claim 19
LU teaches:
generating a distributed constrained Markov decision process (D-CMDP) model (LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section:
PNG
media_image1.png
258
472
media_image1.png
Greyscale
(EN): The M-CMDP of LU is a “distributed” CMDP because its distributed across the different agents in the communications network)
configured to implement decentralized machine learning over a peer-to-peer communication network; (LU, p. 8767, Abstract: “the agents are connected by a peer-to-peer communication network to share information with their neighbors”;
LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP) section: “Consider a team of n agents operating in a common environment,
denoted by N = [n]. No central controller exists to either make the decisions or collect any information for the agents. Agents are instead allowed to communicate with each other over a communication network G = (N; E), with E being the set of communication links that connect the agents. Such a decentralized model with networked agents finds broad applications in distributed cooperative control problems ... and has been advocated as one of the most popular paradigms in decentralized MARL. ... At each node, there would be multiple safety constraints.”;
LU, pp. 8772-73, “Numerical Results” section: “all agents are connected by a well-connected graph as shown in Figure 1(a), where every agent can only exchange their parameters θi with its neighbors through the communication channel (denoted by the green lines).”;
(EN): LU explains that communication network G has nodes for each agent, connected by edges (or communication links) E, corresponding to recited “each agent running on a corresponding node of the peer-to-peer communication network” limitation and the decentralized MARL disclosure corresponds to the recited “configured to implement the decentralized machine learning” limitation.
configured to perform policy optimization using a decentralized policy gradient (PG) method; (LU, p. 8767, Abstract: “In this work, we first formulate this problem as a distributed constrained Markov decision process (D-CMDP) with networked agents. Then, we propose a decentralized policy gradient (PG) method, Safe Dec-PG, to perform policy optimization based on this D-CMDP model over a network.”; LU, p. 8770, “Safe Policy Gradient” section: “The search for an optimal policy can thus be performed by applying the gradient descent-type iterative methods to the parametrized optimization problem.”;
LU, p. 8771, “Safe Dec-PG: Safe Decentralized Policy Gradient” section, Algorithm 1)
the peer-to-peer communication network configured to connect a plurality of software-based agents that are configured to participate in the decentralized machine learning, each agent implemented in software and running on a corresponding node of the peer-to-peer communication network (LU, p. 8767, Abstract: “the agents are connected by a peer-to-peer communication network to share information with their neighbors”;
LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP) section: “Consider a team of n agents operating in a common environment,
denoted by N = [n]. No central controller exists to either make the decisions or collect any information for the agents. Agents are instead allowed to communicate with each other over a communication network G = (N; E), with E being the set of communication links that connect the agents. Such a decentralized model with networked agents finds broad applications in distributed cooperative control problems ... and has been advocated as one of the most popular paradigms in decentralized MARL. ... At each node, there would be multiple safety constraints.”;
LU, p. 8772: “Numerical Results” section: “To show the performance of safe decentralized RL, we test our algorithm .... on the popular OpenAI Gym paradigm. ... The experiments were run on the NVIDIA Tesla V100 GPU with 32GB memory.”;
LU, pp. 8772-73, “Numerical Results” section: “all agents are connected by a well-connected graph as shown in Figure 1(a), where every agent can only exchange their parameters θi with its neighbors through the communication channel (denoted by the green lines).”;
(EN): LU explains that communication network G has nodes for each agent, connected by edges (or communication links) E, corresponding to recited “each agent ... running on a corresponding node of the peer-to-peer communication network” limitation and the decentralized MARL disclosure corresponds to the recited “configured to participate in the decentralized machine learning” limitation and the operating the experiments on a GPU requires software code for the GPU to execute, corresponding to recited “implemented in software” and “software-based agents” limitations)
maximizing a team-average long-term return in performing one or more joint actions, subject to one or more safety constraints, based on an individual reward function; and (LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section: “As a team, the objective of all agents is to collaboratively maximize the globally average return over the network ... with only its local observations of the rewards, subject to some safety constraints”)
participating in operating a physical system ... based on the D-CMDP model, ... (LU, p. 8772: “Numerical Results” section: “The experiments were run on the NVIDIA Tesla V100 GPU with 32GB memory.”; (EN): Fig. 1a discloses the network structure of a 5-agent system, where such system is implemented on the NVIDIA Tesla V100 GPU, corresponding to a “physical system”)
wherein: each agent is configured to participate in operating the given one of the vehicles in accordance with the corresponding distributed constrained Markov decision process (D-CMDP) model; (LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section:
PNG
media_image1.png
258
472
media_image1.png
Greyscale
(EN): The M-CMDP of LU is a “distributed” CMDP because its distributed across the different agents in the communications network, and each agent pertains to a “node” as explained on p. 8769)
However, LU fails to explicitly teach:
comprising a plurality of vehicles in an autonomous vehicle system ... each agent corresponding to a given one of the vehicles;
each vehicle is attempting to reach a destination subject to constraints including gas capacity, minimizing risk of collision, and inter-agent messaging capacity constraints; and
a team reward aims to see that each vehicle reaches its destination in a shortest amount of time subject to the constraints.
However, in a related field of endeavor (distributed driving systems for connected vehicles, see para. 0004), RAN teaches and makes obvious:
comprising a plurality of vehicles in an autonomous vehicle system ... each agent corresponding to a given one of the vehicles; (RAN, para. 0004: “Provided herein is technology related to a distributed driving system (DDS) that provides transportation management and operations and vehicle control for connected and automated vehicles (CAV) and intelligent road infrastructure systems (IRIS) and particularly, but not exclusively, to methods and systems for sending individual vehicles with customized, detailed, and time-sensitive control instructions and traffic information for automated vehicle driving, such as vehicle following, lane changing, route guidance, and other related information.”;
(EN): in combination with LU, the different agents of LU (see Fig. 1(a)) are each implemented on a separate vehicle in a distributed driving system as disclosed by RAN)
each vehicle is attempting to reach a destination subject to constraints including gas capacity, minimizing risk of collision, and inter-agent messaging capacity constraints; and
(RAN, para. 0006: “In some embodiments, the DDS is configured to provide on-demand and dynamic IRT functions to individual CAVs to avoid trajectory conflicts with other vehicles (e.g., collision avoidance) and/or to adjust vehicle route and/or trajectory for abnormal driving environments (e.g., weather events, natural disasters, traffic accidents, etc.) ... In some embodiments, the DUM module is configured to optimize resources by CAVs at various vehicle intelligence levels by optimizing a cost function (e.g., identifying an optimal minimum of the cost function) describing the total cost to implement an automated driving system as a sum of functions (e.g., functions providing positive values) for computation ability cost (C), number of computational units cost (NU), fuel consumption cost (P), climate control and/or driver comfort (e.g., acceleration and/or deceleration) cost (V), and/or IRT cost (I)”;
RAN, para. 0046: “As used herein, the term “connected and autonomous vehicle” or “CAV” refers to an autonomous vehicle that is able to communicate with other vehicles (e.g., by V2V communication)”
RAN, para. 0063: “In some embodiments, the driver input data comprises origin of the trip, destination of the trip, expected travel time, and/or service requests.”
(EN): in combination with LU, each vehicle is subject to constraints including fuel consumption, vehicle-to-vehicle communication and computational unit costs, and collision avoidance, as taught by RAN; the examiner notes that one of ordinary skill would understand that vehicle resources are not unlimited, and that any vehicle would naturally have limitations related to fuel consumption (e.g., limited by the size of a fuel tank), vehicle-to-vehicle communication constraints (e.g., limited by distance and computational power), and collision avoidance (e.g., avoiding accidents at all costs))
Before the effective filing date of the present application, it would have been obvious to combine the Safe Dec-PG system of LU with the teachings of RAN as explained above. As disclosed by RAN, one of ordinary skill would have been motivated to do so because RAN teaches techniques for improving the “safety and stability of individual CAVs according to the needs of individual CAVs.” (para. 0007). One of ordinary skill would further understand the benefit of applying the teachings of LU to the autonomous vehicle use case as demonstrated by RAN.
However, LU and RAN fail to explicitly teach:
a team reward aims to see that each vehicle reaches its destination in a shortest amount of time subject to the constraints.
However, in a related field of endeavor (optimizing a system using reinforcement learning techniques, see para. 0014), SINGH teaches and makes obvious:
a team reward aims to see that each vehicle reaches its destination in a shortest amount of time subject to the constraints. (SINGH, para. 0072: “Based on the environment variables reported by the environment aggregator 515, the reinforcement learning logic center 519 takes optimal actions in order to maximize the aggregated value of the joint-reward function discussed in further detail below.”;
SINGH, para. 0075: “The joint reward function utilized by the performance monitor 517 will now be described. The joint-reward is calculated by the performance monitor 517 using components of the data center 500 environment that represent quantitatively the attributes that need to be increased or decrease (e.g., maximized or minimized) in order to achieve improved performance (e.g., optimality) of the data center 500.”;
(EN): SINGH discloses a joint-reward (corresponding to recited “team award”) so that optimal data center performance is achieved; the LU-RAN-SINGH combination now modifies the Safe Dec-PG system of LU to apply to the autonomous vehicle environment of RAN, where there are constraints with respect to fuel, vehicle-to-vehicle communication resources, and collision avoidance as taught by RAN, and then uses a joint-reward as in SINGH to optimize to the benchmark of shortest amount of time to the destination of RAN, where such optimization is disclosed by SINGH).
Before the effective filing date of the present application, it would have been obvious to combine the Safe Dec-PG system of LU with the cloud computing teachings of UMAPATHY and SINGH as explained above. As disclosed by SINGH, one of ordinary skill would have been motivated to do so in order to use the techniques of SINGH to “maintain[] specified performance benchmarks utilizing a reinforcement learning framework.” (para. 0014). One of ordinary skill would further understand that the benefit of optimizing performance benchmarks can be applied to benchmarks other than power consumption, including vehicle arrival times.
Regarding Claim 20
LU teaches:
A computer program product for federated learning, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform a method comprising: (LU, p. 8772: “Numerical Results” section: “The experiments were run on the NVIDIA Tesla V100 GPU with 32GB memory.”)
The remaining limitations in claim 20 correspond to the method of claim 19, and therefore this claim 20 is rejected for the same reasons explained above with respect to claim 19.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
US 20210342836 A1 (Cella). “Certain considerations to determining an AI solution, and/or configuring operations to perform a selection and/or configuration operation for an AI solution include, without limitation: ... available computing resources such as processing capability, network communication capability, and/or memory storage capability (e.g., to support initial data, training data, input data such as cached, buffered, or stored input data, iterative improvement state data, output data such as cached, buffered, or stored output data, and/or intermediate data storage, such as data to support ongoing calculations, historical data, and/or accumulation data)” (para. 0335).
US 20130041576 A1 (Switkes). “To achieve the foregoing and in accordance with the present invention, systems and methods for a Semi-Autonomous Vehicular Convoying are provided. In particular the systems and methods for 1) A close following distance to save significant fuel, 2) Safety in the event of emergency maneuvers by the leading vehicle, 3) Safety in the event of component failures in the system, 4) An efficient mechanism for vehicles to find a partner vehicle to follow or be followed by 5) An intelligent ordering of the vehicles based on several criteria, 6) Other fuel economy optimizations made possible by the close following, 7) Control algorithms to ensure smooth, comfortable, precise maintenance of the following distance, 8) Robust failsafe mechanical hardware, 9)Robust failsafe electronics and communication, 10) Other communication between the vehicles for the benefit of the driver, 11) Prevention of other types of accidents unrelated to the close following mode, 12) A simpler system to enable a vehicle to serve as a leading vehicle without the full system” (para. 0010).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL C LEE whose telephone number is (571)272-4933. The examiner can normally be reached M-F 12:00 pm - 8:00 pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Omar Fernandez Rivas can be reached at 571-272-2589. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/MICHAEL C. LEE/Examiner, Art Unit 2128