Last updated: July 17, 2026
Application No. 17/499,815
DECENTRALIZED POLICY GRADIENT DESCENT AND ASCENT FOR SAFE MULTI-AGENT REINFORCEMENT LEARNING

Non-Final OA §103
Filed
Oct 12, 2021
Examiner
LEE, MICHAEL CHRISTOPHER
Art Unit
2128
Tech Center
2100 — Computer Architecture & Software
Assignee
Rensselaer Polytechnic Institute
OA Round
4 (Non-Final)
Interview Optional

— +25.8% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 62% grant rate with +25.8% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 149 resolved cases, 2023–2026
Examiner Intelligence

LEE, MICHAEL CHRISTOPHER View full profile →
Grants 62% of resolved cases
Career Allowance Rate
93 granted / 149 resolved
+7.4% vs TC avg
Strong +26% interview lift
Without
With
+25.8%
Interview Lift
resolved cases with interview
Typical timeline
3y 3m
Avg Prosecution
47 currently pending
Career history
197
Total Applications
across all art units
Statute-Specific Performance

§101
18.5%
-21.5% vs TC avg
§103
76.3%
+36.3% vs TC avg
§102
0.7%
-39.3% vs TC avg
§112
4.2%
-35.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 149 resolved cases
Office Action

§103
DETAILED ACTION
Notice of AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
Applicant’s Amendment and remarks dated 4/15/2026 have been considered.  Claims 1-16 and 19-20 are pending.
Non-Final Action.  Applicant’s arguments on pages 8-10 of its 4/15/2026 Amendment and remarks with respect to the rejection of claim 1 under 35 U.S.C. 103 have been reviewed and are persuasive.  While new grounds of rejection are set forth herein, because such rejections were not necessitated by any claim amendments, this action is Non-Final.
Response to Arguments
On page 8 of Applicant’s 4/15/2026 Amendment and remarks, Applicant asserts that the amendments to claims 19 and 20 are supported by at least paras. 0107 and 0109 of the instant specification.
The examiner agrees that the portions of the disclosure identified by Applicant provide sufficient written description disclosure.

On pages 9-10 of Applicant’s 4/15/2026 Amendment and remarks, with respect to the rejection to claims 1, 19, and 20 under 35 U.S.C. 103, Applicant argues that the SINGH reference does not teach the “a team reward aims to see that all computing tasks are performed in the shortest amount of time subject to the constraints.”
Applicant’s arguments have been considered and are persuasive.  While SINGH clearly discloses a joint-reward function for optimizing performance, SINGH does not explicitly teach optimizing with respect to the shortest amount of time.  While such a modification is likely within the capabilities of one of ordinary skill in the art at the time, in view of a better reference found by the examiner (YOUNG), new grounds of rejection are set forth herein with respect to this limitation in each of the independent claims.

On pages 10-12 of Applicant’s 4/15/2026 Amendment and remarks, with respect to the rejection to claims 19 and 20 under 35 U.S.C. 103, as amended Applicant argues that the RAN reference does not teach the “each vehicle is attempting to reach a destination subject to constraints including inter-agent messaging capacity constraints.”
The examiner agrees that as amended, RAN does not specifically teach this limitation.  However, new grounds of rejection in view of the GE reference are provided herein.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1-16 are rejected under 35 U.S.C. 103 as being unpatentable over Lu, Songtao, et al. "Decentralized Policy Gradient Descent Ascent for Safe Multi-Agent Reinforcement Learning." (2021), pp. 8767-8775, hereinafter referenced as LU, in view of US 20210007023 A1, hereinafter referenced as UMAPATHY, and further in view of US 20210325891 A1, hereinafter referenced as YOUNG.

Regarding Claim 1
	LU teaches:
A reinforcement learning system, comprising: (LU, p. 8768, “Main Contributions” section: “we propose the first safe decentralized policy gradient (PG) descent and ascent algorithm, i.e., Safe Dec-PG, which is able to deal with a class of multi-agent safe RL problems over a graph.”; Examiner’s Note (EN): page 8767, Introduction, explains that “RL” is an acronym for reinforcement learning)
a plurality of agents, each agent implemented in software, having an individual reward function and one or more safety constraints that involve joint actions of the agents, (LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section: “Consider a team of n agents operating in a common environment, denoted by N = [n]. No central controller exists to either make the decisions or collect any information for the agents. ... each agent has some safety constraints, in the forms of bounds on some long term cost, that involve the joint policy of all agents. ... Each agent i only has access to its own reward”; 
LU, p. 8772: “Numerical Results” section: “To show the performance of safe decentralized RL, we test our algorithm .... on the popular OpenAI Gym paradigm. ... The experiments were run on the NVIDIA Tesla V100 GPU with 32GB memory.”;
(EN): Definition 1 on p. 8769 further discloses that each agent has a local reward function; the examiner notes that one of ordinary skill would understand that operating the experiments on a GPU requires software code for the GPU to execute, corresponding to recited “implemented in software” limitation) 
wherein each agent maximizes a team-average long-term return in performing the joint actions, subject to the safety constraints, (LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section: “As a team, the objective of all agents is to collaboratively maximize the globally average return over the network ... with only its local observations of the rewards, subject to some safety constraints”) 
and participates in decentralized machine learning and in operating the physical system. (LU, p. 8768, “Main Contributions” section: “In this work, by leveraging the min-max saddle-point formulation, we propose the first safe decentralized policy gradient (PG) descent and ascent algorithm, i.e., Safe Dec-PG, which is able to deal with a class of multi-agent safe RL problems over a graph.”
LU, p. 8772: “Numerical Results” section: “The experiments were run on the NVIDIA Tesla V100 GPU with 32GB memory.”; 
(EN): Fig. 1a discloses the network structure of a 5-agent system, where such system is implemented on the NVIDIA Tesla V100 GPU, corresponding to a “physical system”; decentralized reinforcement learning (RL) (see definition on page 8767) corresponds to recited “decentralized machine learning” limitation)
a peer-to-peer communication network configured to connect the plurality of agents and configured to implement the decentralized machine learning, each agent running on a corresponding node of the peer-to-peer communication network; and (LU, p. 8767, Abstract: “the agents are connected by a peer-to-peer communication network to share information with their neighbors”; 
LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP) section: “Consider a team of n agents operating in a common environment,
denoted by N = [n]. No central controller exists to either make the decisions or collect any information for the agents. Agents are instead allowed to communicate with each other over a communication network G = (N; E), with E being the set of communication links that connect the agents. Such a decentralized model with networked agents finds broad applications in distributed cooperative control problems ... and has been advocated as one of the most popular paradigms in decentralized MARL. ... At each node, there would be multiple safety constraints.”;
LU, pp. 8772-73, “Numerical Results” section: “all agents are connected by a well-connected graph as shown in Figure 1(a), where every agent can only exchange their parameters θi with its neighbors through the communication channel (denoted by the green lines).”;
(EN): LU explains that communication network G has nodes for each agent, connected by edges (or communication links) E, corresponding to recited “each agent running on a corresponding node of the peer-to-peer communication network” limitation and the decentralized MARL disclosure corresponds to the recited “configured to implement the decentralized machine learning” limitation.
a distributed constrained Markov decision process (D-CMDP) model implemented over the peer-to-peer communication network (LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section: 
    PNG
    media_image1.png
    258
    472
    media_image1.png
    Greyscale

(EN): The M-CMDP of LU is a “distributed” CMDP because its distributed across the different agents in the communications network) 
and configured to perform policy optimization using a decentralized policy gradient (PG) method, (LU, p. 8767, Abstract: “In this work, we first formulate this problem as a distributed constrained Markov decision process (D-CMDP) with networked agents. Then, we propose a decentralized policy gradient (PG) method, Safe Dec-PG, to perform policy optimization based on this D-CMDP model over a network.”; LU, p. 8770, “Safe Policy Gradient” section: “The search for an optimal policy can thus be performed by applying the gradient descent-type iterative methods to the parametrized optimization problem.”; LU, p. 8771, “Safe Dec-PG: Safe Decentralized Policy Gradient” section, Algorithm 1) 
wherein the participation of each agent in operating the physical system is based on the D-CMDP model; (LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section: 
    PNG
    media_image1.png
    258
    472
    media_image1.png
    Greyscale

(EN): The M-CMDP of LU is a “distributed” CMDP because its distributed across the different agents in the communications network)
wherein: each agent resides on one of the computing nodes and each computing node is configured to operate in accordance with the corresponding distributed constrained Markov decision process (D-CMDP) model (LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section:
 
    PNG
    media_image1.png
    258
    472
    media_image1.png
    Greyscale

(EN): The M-CMDP of LU is a “distributed” CMDP because its distributed across the different agents in the communications network, and each agent pertains to a “node” as explained on p. 8769)
each computing node performs computing tasks subject to constraints including..., and data storage capacity; (LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section: “At each node, there would be multiple safety constraints. These rewards describe different objectives that the agent is required to achieve, such as remaining with a region of the state space, or not running out of memory/battery”; 
(EN): LU discloses that each computing node has constraints, including not running out of memory (corresponding to recited “data storage capacity” constraints)

	However, LU fails to explicitly teach:
a physical system comprising a plurality of computing nodes configured as a cloud computing environment
local computing capacity, inter-node communication capacity 
a team reward aims to see that all computing tasks are performed in the shortest amount of time subject to the constraints.

However, in a related field of endeavor (optimizing communications networks using reinforcement learning, see para. 0063), UMAPATHY discloses:
a physical system comprising a plurality of computing nodes configured as a cloud computing environment (UMAPATHY, para. 0043: “The cloud 144 may represent a cloud computing architecture/platform that provides one or more cloud computing services. Cloud computing refers to a paradigm for enabling network access to a scalable and elastic pool of shareable computing resources with self-service provisioning and administration on-demand and without active management by users.”;
(EN): in combination with LU, the different agents of LU (see Fig. 1(a)) are each implemented on a separate node in a cloud computing environment as in UMAPATHY)
each computing node performs computing tasks subject to constraints including local computing capacity, inter-node communication capacity, and data storage capacity; (UMAPATHY, para. 0083: “This may involve, for example, evaluating both computation and communication resources needed for different data transfers, threshold criteria or a desired level of reliability, amount or type of compute node capabilities (e.g., a certain processor speed, memory/storage capacity, RAT circuitries implemented by the computing system 201, etc.), type of operational constraints under which the computing system 201 is operating (e.g., radio conditions and/or link quality, a surrounding environmental conditions (e.g., temperature, humidity, altitude, etc.), component temperature, etc.), and/or the like.”; 
(EN): in combination with LU, each computing node is subject to constraints including compute node capabilities, communication resources needed, and storage capacity as taught by UMAPATHY; the examiner notes that one of ordinary skill would understand that node resources are not unlimited, and that any computing node would naturally have limitations related to processing power (e.g., limited by the number of processors and their processing speeds), communication capacity (e.g., limited by input/output ports and devices), and storage capacity (e.g., limited by actual storage constraints))

	Before the effective filing date of the present application, it would have been obvious to combine the Safe Dec-PG system of LU with the teachings of UMAPATHY as explained above.  As disclosed by UMAPATHY, one of ordinary skill would have been motivated to do so because UMAPATHY teaches that an “advantage” of using a reinforcement learning system is to help “tailor” “recommendations to suit the user's behavior and preferences, and/or suit the capabilities and/or operational constraints of the compute node.” (para. 0109).  UMAPATHY further discloses the benefit of extending cloud services to fog services for IoT devices. (para. 0024).

However, LU and UMAPATHY fail to explicitly teach:
a team reward aims to see that all computing tasks are performed in the shortest amount of time subject to the constraints.

However, in a related field of endeavor (reinforcement learning technique for multi-agent systems, see paras. 0003, 0005), YOUNG teaches and makes obvious:

a team reward aims to see that all computing tasks are performed in the shortest amount of time subject to the constraints. (YOUNG, para. 0020: “Embodiments can harness and apply advancements in MARL for spatial-temporal domains air, land, sea, space, and cyber. Embodiments can include training of an ML technique for path generation, multi-agent training that learns cooperative behaviors using shared rewards and learned or pre-defined communication protocols, execution of learned policies in new environments, and responding to dynamic (never before seen) changes in an environment.”;
YOUNG, para. 0068: “Modifying the shared reward function can allow modeling of important factors, such as limited amount of available fuel, and multi-objectives, such as minimizing the time to reach high priority targets while simultaneously minimizing risk.”
YOUNG, para. 0101: “ In a cyber setting, MARL can enable cooperative behavior of robot teams for many different applications, such as self-driving cars, and self-driving forklifts in smart factories. These applications can benefit from the asset policies independently learned using shared rewards which encourage meeting multiple objectives while satisfying multiple constraints.”;
(EN): YOUNG discloses a shared (corresponding to recited “team award”) where an objective can be “minimizing the time to reach higher priority targets while simultaneously minimizing risk”; the LU-UMAPATHY-YOUNG combination now modifies the Safe Dec-PG system of LU to apply to the cloud computing environment of UMAPATHY, where there are constraints with respect to computing power, communication resources, and storage capacity as taught by UMAPATHY, and then uses a team reward as in YOUNG to see that all tasks of UMAPATHY are computed by “minimizing the time” as in YOUNG)

	Before the effective filing date of the present application, it would have been obvious to combine the Safe Dec-PG system of LU with the cloud computing teachings of UMAPATHY and YOUNG as explained above.  As disclosed by YOUNG, one of ordinary skill would have been motivated to do so in order to use the techniques of YOUNG with respect to “MARL derived policies, which can optimize multi-objective goals, while satisfying constraints as a replacement for MILP solutions. This allows the trained agents to better respond to dynamic changing conditions.” (para. 0068).  

Regarding Claim 2
	LU, UMAPATHY, and YOUNG teach the system of claim 1.  LU further teaches:
wherein the individual reward function and the safety constraints are known to the corresponding agent and unknown to the remaining agents. (LU, p. 8767, Abstract: “both the rewards and constraints are only known to each agent locally/privately.”; LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section: “The states s and actions a are globally observable, while the rewards and costs are observed locally/privately at each agent.”)

Regarding Claim 3
	LU, UMAPATHY, and YOUNG teach the system of claim 1.  LU further teaches:
wherein each agent is configured to explore interactions with an environment to maximize a cumulative reward through a reinforcement learning process. (LU, p. 8767, Introduction, “where an agent explores the interactions with an environment so that it is able to maximize a cumulative reward through this learning process.”)

Regarding Claim 4
	LU, UMAPATHY, and YOUNG teach the system of claim 1.  LU further teaches:
wherein the safety constraints are in a form of bounds on a long term cost associated with a joint policy of the agents. (LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section:

    PNG
    media_image2.png
    386
    468
    media_image2.png
    Greyscale

(EN): the lower-bounds of the negative team-average discounted long-term return denotes the “costs” of an agent, and is related to the joint policy of the agents as explained above)

Regarding Claim 5
	LU, UMAPATHY, and YOUNG teach the system of claim 1.  LU further teaches:
wherein each agent is described by a tuple 
    PNG
    media_image3.png
    28
    252
    media_image3.png
    Greyscale
where S comprises a state space shared by the agents, a graph G represents the peer-to-peer communication network, 
    PNG
    media_image4.png
    24
    92
    media_image4.png
    Greyscale
 comprises a joint action space of the agents, 
    PNG
    media_image5.png
    28
    288
    media_image5.png
    Greyscale
comprise local rewards and cost functions of corresponding agent i, P : S x A x S [Wingdings font/0xE0] [0,1] comprises a state transition probability of the Markov decision process, and 
    PNG
    media_image6.png
    20
    28
    media_image6.png
    Greyscale
(0,1) denotes a discount factor. (LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section: 
    PNG
    media_image1.png
    258
    472
    media_image1.png
    Greyscale
)

Regarding Claim 6
	LU, UMAPATHY, and YOUNG teach the system of claim 5.  LU further teaches:
an objective of the agents is to collaboratively maximize a globally average return over the peer-to-peer communication network, dictated by 
    PNG
    media_image7.png
    28
    204
    media_image7.png
    Greyscale
, with only local observations of the local rewards, subject to the corresponding safety constraints dictated by 
    PNG
    media_image8.png
    26
    54
    media_image8.png
    Greyscale
, and wherein each agent is associated with m of the cost functions and 
    PNG
    media_image8.png
    26
    54
    media_image8.png
    Greyscale
comprises a mapping S x A to 
    PNG
    media_image9.png
    24
    32
    media_image9.png
    Greyscale
 (LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section: “As a team, the objective of all agents is to collaboratively maximize the globally average return over the network (equivalently to minimize the opposite of it), dictated by
    PNG
    media_image7.png
    28
    204
    media_image7.png
    Greyscale
, with only its local observations of the rewards, subject to some safety constraints dictated by
    PNG
    media_image8.png
    26
    54
    media_image8.png
    Greyscale
. ... Here, we assume that each agent is associated with m cost functions, so
    PNG
    media_image8.png
    26
    54
    media_image8.png
    Greyscale
 is a mapping
from S x A to 
    PNG
    media_image9.png
    24
    32
    media_image9.png
    Greyscale
)

Regarding Claim 7
	LU, UMAPATHY, and YOUNG teach the system of claim 5.  LU further teaches:
wherein the local rewards describe different objectives that the corresponding agent is to achieve.  (LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section: “These rewards describe different objectives that the agent is required to achieve”)

Regarding Claim 8
	LU, UMAPATHY, and YOUNG teach the system of claim 5.  LU further teaches:
wherein, at time t, each agent i selects an action 
    PNG
    media_image10.png
    22
    20
    media_image10.png
    Greyscale
given a state 
    PNG
    media_image11.png
    24
    24
    media_image11.png
    Greyscale
 according to a corresponding local policy, 
    PNG
    media_image12.png
    22
    110
    media_image12.png
    Greyscale
, which is parametrized as 
    PNG
    media_image13.png
    22
    348
    media_image13.png
    Greyscale


    PNG
    media_image14.png
    84
    588
    media_image14.png
    Greyscale

(LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section: 

    PNG
    media_image15.png
    184
    482
    media_image15.png
    Greyscale
)
Regarding Claim 9
	LU, UMAPATHY, and YOUNG teach the system of claim 5.  LU further teaches:
wherein the joint policy πθ is: 

    PNG
    media_image16.png
    234
    600
    media_image16.png
    Greyscale

(LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section: 

    PNG
    media_image17.png
    364
    474
    media_image17.png
    Greyscale
)
Regarding Claim 10
	LU, UMAPATHY, and YOUNG teach the system of claim 6.  LU further teaches:

    PNG
    media_image18.png
    190
    594
    media_image18.png
    Greyscale


    PNG
    media_image19.png
    114
    410
    media_image19.png
    Greyscale


    PNG
    media_image20.png
    158
    404
    media_image20.png
    Greyscale

(LU, p. 8770, “Safe Policy Gradient” section: 

    PNG
    media_image21.png
    320
    472
    media_image21.png
    Greyscale
)

Regarding Claim 11
	LU, UMAPATHY, and YOUNG teach the system of claim 1.  LU further teaches:
wherein parameters of the individual reward function for each agent are updated by:

    PNG
    media_image22.png
    154
    608
    media_image22.png
    Greyscale

(LU, p. 8771, “Safe Dec-PG: Safe Decentralized Policy Gradient” section: 
    PNG
    media_image23.png
    326
    478
    media_image23.png
    Greyscale
)

Regarding Claim 12
	LU, UMAPATHY, and YOUNG teach the system of claim 11.  LU further teaches:
wherein the variable 
    PNG
    media_image24.png
    28
    22
    media_image24.png
    Greyscale
is updated locally as:

    PNG
    media_image25.png
    32
    404
    media_image25.png
    Greyscale


    PNG
    media_image26.png
    32
    118
    media_image26.png
    Greyscale

(LU, p. 8771, “Safe Dec-PG: Safe Decentralized Policy Gradient” section:

    PNG
    media_image27.png
    206
    476
    media_image27.png
    Greyscale
)

Regarding Claim 13
	LU, UMAPATHY, and YOUNG teach the system of claim 12.  LU further teaches:
wherein the update is based on:

    PNG
    media_image28.png
    40
    550
    media_image28.png
    Greyscale


    PNG
    media_image29.png
    44
    592
    media_image29.png
    Greyscale

(LU, p. 8771, “Safe Dec-PG: Safe Decentralized Policy Gradient” section:

    PNG
    media_image30.png
    140
    476
    media_image30.png
    Greyscale
)

Regarding Claim 14
	LU, UMAPATHY, and YOUNG teach the system of claim 1.  LU further teaches:
wherein rewards considered in an individual reward function of the agents are based on a distance between a location of an agent and its desired landmark and a minimum distance between two agents of the plurality of agents. (LU, p. 8773, “Environment” section: “The rewards considered in the objective function include two parts: i) the first one is based on the distance between the location of the node to its desired landmark, which is a monotonically decreasing function of the distance, (i.e., the smaller the distance, the higher the reward will be); ii) the second one is determined by the minimum distance between two agents.”)

Regarding Claim 15
	LU, UMAPATHY, and YOUNG teach the system of claim 1.  LU further teaches:
wherein two of the plurality of agents are penalized by a negative reward value in response to a distance between the two agents being lower than a defined threshold. (LU, p. 8773, “Environment” section: “If the distance between two agents is lower than a threshold, then we consider that a collision happens, and both of the agents will be penalized by a large negative reward value, i.e., -1.”)

Regarding Claim 16
	LU, UMAPATHY, and YOUNG teach the system of claim 15.  LU further teaches:
wherein a reward at one of the two agents is scaled by positive coefficients representing a heterogeneity of different agents. (LU, p. 8773, “Environment” section: “the reward at each agent is further scaled by different positive coefficients, representing the heterogeneity, e.g., priority levels, of different agents.”

Claims 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over LU in view of US 20210065547 A1, hereinafter referenced as RAN, and further in view of Ge, Xiaohua. "Distributed formation control of networked multi-agent systems using a dynamic event-triggered communication mechanism." IEEE Transactions on Industrial Electronics 64.10 (2017): 8118-8127, hereinafter referenced as GE, and further in view of YOUNG.

Regarding Claim 19
	LU teaches:
generating a distributed constrained Markov decision process (D-CMDP) model (LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section: 
    PNG
    media_image1.png
    258
    472
    media_image1.png
    Greyscale

(EN): The M-CMDP of LU is a “distributed” CMDP because its distributed across the different agents in the communications network) 
configured to implement decentralized machine learning over a peer-to-peer communication network; (LU, p. 8767, Abstract: “the agents are connected by a peer-to-peer communication network to share information with their neighbors”; 
LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP) section: “Consider a team of n agents operating in a common environment,
denoted by N = [n]. No central controller exists to either make the decisions or collect any information for the agents. Agents are instead allowed to communicate with each other over a communication network G = (N; E), with E being the set of communication links that connect the agents. Such a decentralized model with networked agents finds broad applications in distributed cooperative control problems ... and has been advocated as one of the most popular paradigms in decentralized MARL. ... At each node, there would be multiple safety constraints.”;
LU, pp. 8772-73, “Numerical Results” section: “all agents are connected by a well-connected graph as shown in Figure 1(a), where every agent can only exchange their parameters θi with its neighbors through the communication channel (denoted by the green lines).”;
(EN): LU explains that communication network G has nodes for each agent, connected by edges (or communication links) E, corresponding to recited “each agent running on a corresponding node of the peer-to-peer communication network” limitation and the decentralized MARL disclosure corresponds to the recited “configured to implement the decentralized machine learning” limitation.
configured to perform policy optimization using a decentralized policy gradient (PG) method; (LU, p. 8767, Abstract: “In this work, we first formulate this problem as a distributed constrained Markov decision process (D-CMDP) with networked agents. Then, we propose a decentralized policy gradient (PG) method, Safe Dec-PG, to perform policy optimization based on this D-CMDP model over a network.”; LU, p. 8770, “Safe Policy Gradient” section: “The search for an optimal policy can thus be performed by applying the gradient descent-type iterative methods to the parametrized optimization problem.”; 
LU, p. 8771, “Safe Dec-PG: Safe Decentralized Policy Gradient” section, Algorithm 1)
the peer-to-peer communication network configured to connect a plurality of software-based agents that are configured to participate in the decentralized machine learning, each agent implemented in software and running on a corresponding node of the peer-to-peer communication network (LU, p. 8767, Abstract: “the agents are connected by a peer-to-peer communication network to share information with their neighbors”; 
LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP) section: “Consider a team of n agents operating in a common environment,
denoted by N = [n]. No central controller exists to either make the decisions or collect any information for the agents. Agents are instead allowed to communicate with each other over a communication network G = (N; E), with E being the set of communication links that connect the agents. Such a decentralized model with networked agents finds broad applications in distributed cooperative control problems ... and has been advocated as one of the most popular paradigms in decentralized MARL. ... At each node, there would be multiple safety constraints.”;
LU, p. 8772: “Numerical Results” section: “To show the performance of safe decentralized RL, we test our algorithm .... on the popular OpenAI Gym paradigm. ... The experiments were run on the NVIDIA Tesla V100 GPU with 32GB memory.”;
LU, pp. 8772-73, “Numerical Results” section: “all agents are connected by a well-connected graph as shown in Figure 1(a), where every agent can only exchange their parameters θi with its neighbors through the communication channel (denoted by the green lines).”;
(EN): LU explains that communication network G has nodes for each agent, connected by edges (or communication links) E, corresponding to recited “each agent ... running on a corresponding node of the peer-to-peer communication network” limitation and the decentralized MARL disclosure corresponds to the recited “configured to participate in the decentralized machine learning” limitation and the operating the experiments on a GPU requires software code for the GPU to execute, corresponding to recited “implemented in software” and “software-based agents” limitations) 
maximizing a team-average long-term return in performing one or more joint actions, subject to one or more safety constraints, based on an individual reward function; and (LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section: “As a team, the objective of all agents is to collaboratively maximize the globally average return over the network ... with only its local observations of the rewards, subject to some safety constraints”)
participating in operating a physical system ... based on the D-CMDP model, ... (LU, p. 8772: “Numerical Results” section: “The experiments were run on the NVIDIA Tesla V100 GPU with 32GB memory.”; (EN): Fig. 1a discloses the network structure of a 5-agent system, where such system is implemented on the NVIDIA Tesla V100 GPU, corresponding to a “physical system”)
wherein: each agent is configured to participate in operating the given one of the vehicles in accordance with the corresponding distributed constrained Markov decision process (D-CMDP) model; (LU, p. 8769, “Multi-Agent Constrained Markov Decision Process (M-CMDP)” section:
 
    PNG
    media_image1.png
    258
    472
    media_image1.png
    Greyscale

(EN): The M-CMDP of LU is a “distributed” CMDP because its distributed across the different agents in the communications network, and each agent pertains to a “node” as explained on p. 8769)

	However, LU fails to explicitly teach:
	comprising a plurality of vehicles in an autonomous vehicle system ... each agent corresponding to a given one of the vehicles; 
each vehicle is attempting to reach a destination subject to constraints including inter-agent messaging capacity constraints; and
a team reward aims to see that each vehicle reaches its destination in a shortest amount of time subject to the constraints.

	However, in a related field of endeavor (distributed driving systems for connected vehicles, see para. 0004), RAN teaches and makes obvious:
comprising a plurality of vehicles in an autonomous vehicle system ... each agent corresponding to a given one of the vehicles; (RAN, para. 0004: “Provided herein is technology related to a distributed driving system (DDS) that provides transportation management and operations and vehicle control for connected and automated vehicles (CAV) and intelligent road infrastructure systems (IRIS) and particularly, but not exclusively, to methods and systems for sending individual vehicles with customized, detailed, and time-sensitive control instructions and traffic information for automated vehicle driving, such as vehicle following, lane changing, route guidance, and other related information.”; 
(EN): in combination with LU, the different agents of LU (see Fig. 1(a)) are each implemented on a separate vehicle in a distributed driving system as disclosed by RAN)
each vehicle is attempting to reach a destination subject to constraints; and 
(RAN, para. 0006: “In some embodiments, the DDS is configured to provide on-demand and dynamic IRT functions to individual CAVs to avoid trajectory conflicts with other vehicles (e.g., collision avoidance) and/or to adjust vehicle route and/or trajectory for abnormal driving environments (e.g., weather events, natural disasters, traffic accidents, etc.) ... In some embodiments, the DUM module is configured to optimize resources by CAVs at various vehicle intelligence levels by optimizing a cost function (e.g., identifying an optimal minimum of the cost function) describing the total cost to implement an automated driving system as a sum of functions (e.g., functions providing positive values) for computation ability cost (C), number of computational units cost (NU), fuel consumption cost (P), climate control and/or driver comfort (e.g., acceleration and/or deceleration) cost (V), and/or IRT cost (I)”;
RAN, para. 0046: “As used herein, the term “connected and autonomous vehicle” or “CAV” refers to an autonomous vehicle that is able to communicate with other vehicles (e.g., by V2V communication)”
RAN, para. 0063: “In some embodiments, the driver input data comprises origin of the trip, destination of the trip, expected travel time, and/or service requests.”
(EN): in combination with LU, each vehicle is subject to constraints including fuel consumption, vehicle-to-vehicle communication and computational unit costs, and collision avoidance, as taught by RAN; the examiner notes that one of ordinary skill would understand that vehicle resources are not unlimited, and that any vehicle would naturally have limitations related to fuel consumption (e.g., limited by the size of a fuel tank), vehicle-to-vehicle communication constraints (e.g., limited by distance and computational power), and collision avoidance (e.g., avoiding accidents at all costs))

	Before the effective filing date of the present application, it would have been obvious to combine the Safe Dec-PG system of LU with the teachings of RAN as explained above.  As disclosed by RAN, one of ordinary skill would have been motivated to do so because RAN teaches techniques for improving the “safety and stability of individual CAVs according to the needs of individual CAVs.” (para. 0007).  One of ordinary skill would further understand the benefit of applying the teachings of LU to the autonomous vehicle use case as demonstrated by RAN.

However, LU and RAN fail to explicitly teach:
including inter-agent messaging capacity constraints
a team reward aims to see that each vehicle reaches its destination in a shortest amount of time subject to the constraints.

	However, in a related field of endeavor (multi-agent systems, see p. 8118, section I), GE teaches and makes obvious:
including inter-agent messaging capacity constraints (GE, p. 8118, section I:” In this regard, inter-agent communication plays a crucial role in formation control. It should be pointed out that most of the existing results on formation control of MASs including the ones aforementioned have been based on an explicit assumption that data communication among agents is performed continually at every instant of time or periodically at equidistant sampling in stants. Nevertheless, this assumption may be unrealistic in some practical applications.”
GE, p. 8119, section I: “A new dynamic event-triggered communication mechanism (DECM) will be developed to schedule inter-agent communication. In contrast to most of the existing ECMs, the threshold parameter in the proposed event triggering condition will not be fixed permanently but vary with time by following a dynamic rule. It will be shown numerically that the DECM can achieve a better tradeoff between reducing data transmissions through a communication network and preserving favorable formation performance”;
Examiner’s Note: with respect to multi-agent systems that require formation control, such as unmanned aerial vehicles or autonomous underwater vehicles, GE teaches that there are constraints about how often inter-agent communications can be made; the LU-RAN-GE combination now makes the inter-agent communication trade-offs for vehicle formation control as taught by GE as part of the constraints used in RAN to reach a destination)

Before the effective filing date of the present application, it would have been obvious to combine the Safe Dec-PG system of LU with the teachings of RAN and GE as explained above.  As disclosed by GE, one of ordinary skill would have been motivated to do so because GE teaches that “inter-agent communications plays a crucial role in formation control” with respect to formations of traveling vehicles. (p. 8118, section I).

However, LU, RAN, and GE fail to explicitly teach:
a team reward aims to see that each vehicle reaches its destination in a shortest amount of time subject to the constraints.

However, in a related field of endeavor (reinforcement learning technique for multi-agent systems, see paras. 0003, 0005), YOUNG teaches and makes obvious:
a team reward aims to see that each vehicle reaches its destination in a shortest amount of time subject to the constraints. (YOUNG, para. 0020: “Embodiments can harness and apply advancements in MARL for spatial-temporal domains air, land, sea, space, and cyber. Embodiments can include training of an ML technique for path generation, multi-agent training that learns cooperative behaviors using shared rewards and learned or pre-defined communication protocols, execution of learned policies in new environments, and responding to dynamic (never before seen) changes in an environment.”;
YOUNG, para. 0068: “Modifying the shared reward function can allow modeling of important factors, such as limited amount of available fuel, and multi-objectives, such as minimizing the time to reach high priority targets while simultaneously minimizing risk.”
YOUNG, para. 0101: “ In a cyber setting, MARL can enable cooperative behavior of robot teams for many different applications, such as self-driving cars, and self-driving forklifts in smart factories. These applications can benefit from the asset policies independently learned using shared rewards which encourage meeting multiple objectives while satisfying multiple constraints.”;
(EN): YOUNG discloses a shared (corresponding to recited “team award”) where an objective can be “minimizing the time to reach higher priority targets while simultaneously minimizing risk”; the LU-RAN-GE-YOUNG combination now modifies the Safe Dec-PG system of LU to apply to the autonomous vehicle environment of RAN, where there are constraints with respect to fuel consumption and collision avoidance, to utilize inter-agent communication constraints for formation control as in GE, and then uses a team reward as in YOUNG to see that constraints of RAN are computed by “minimizing the time” as in YOUNG)

	Before the effective filing date of the present application, it would have been obvious to combine the Safe Dec-PG system of LU with the teachings of RAN, GE and YOUNG as explained above.  As disclosed by YOUNG, one of ordinary skill would have been motivated to do so in order to use the techniques of YOUNG with respect to “MARL derived policies, which can optimize multi-objective goals, while satisfying constraints as a replacement for MILP solutions. This allows the trained agents to better respond to dynamic changing conditions.” (para. 0068).  

Regarding Claim 20
	LU teaches:
A computer program product for federated learning, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform a method comprising: (LU, p. 8772: “Numerical Results” section: “The experiments were run on the NVIDIA Tesla V100 GPU with 32GB memory.”)
	The remaining limitations in claim 20 correspond to the method of claim 19, and therefore this claim 20 is rejected for the same reasons explained above with respect to claim 19.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
US 8201773 B1 (Durham).  “Presently, wireless networks of unmanned vehicles (UVs) and sensors are currently being used by the military for tactical intelligence, surveillance, and reconnaissance (ISR). UV operations are constrained by several physical and size factors. Specifically, network capabilities are limited by available remote power and communication capacities (i.e. baud and bit rate) between network nodes” (col. 2, lines 15-21).
US 10721315 B1 (Ezra).  “In some embodiments, the data 240 may include, for example, network configurations, requirements of service demands, routes for existing service, capacity of the network devices and each service path, and so on. For example, the data 240 may include network topology data of the service network 160, capacity data of the network devices 120, and capacity data of the communication links between the network devices 120. The data 240 may also include requirements data of service demands and resource allocation data for each service in the service network 160.” (col. 9, line 63 – col. 10, line 5).
US 20170264395 A1 (Wakabayashi).  “D2D communications offer the possibility to address the aforementioned problems of network capacity and the requirement of network coverage for communications between LTE devices. For example, if user data can be communicated directly between UEs only one set of resources is required to communicate the data rather than both uplink and downlink resources. Furthermore, if UEs are capable of communicating directly, UEs within range of each other may communicate even when outside of a coverage area provided an eNodeB. As a result of these potential benefits, the introduction of D2D capabilities into LTE systems has been proposed.” (para. 0030).

Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL C LEE whose telephone number is (571)272-4933. The examiner can normally be reached M-F 12:00 pm - 8:00 pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Omar Fernandez Rivas can be reached at 571-272-2589. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/MICHAEL C. LEE/Examiner, Art Unit 2128
Read full office action
Prosecution Timeline

Show 6 earlier events
Oct 09, 2025
Applicant Interview (Telephonic)
Oct 20, 2025
Applicant Interview (Telephonic)
Oct 20, 2025
Examiner Interview Summary
Dec 11, 2025
Request for Continued Examination
Dec 17, 2025
Response after Non-Final Action
Jan 15, 2026
Non-Final Rejection mailed — §103
Apr 15, 2026
Response Filed
May 21, 2026
Non-Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/970,943
Patent 12645972
Performing Property Estimation Using Quantum Gradient Operation on Quantum Computing System
3y 7m to grant Granted Jun 02, 2026
17/475,724
Patent 12603081
METHOD AND SERVER FOR A TEXT-TO-SPEECH PROCESSING
4y 7m to grant Granted Apr 14, 2026
17/732,871
Patent 12602605
QUANTUM COMPUTER ARCHITECTURE BASED ON MULTI-QUBIT GATES
3y 11m to grant Granted Apr 14, 2026
17/207,554
Patent 12591915
METHODS AND SYSTEMS FOR DETERMINING RECOMMENDATIONS BASED ON REAL-TIME OPTIMIZATION OF MACHINE LEARNING MODELS
5y 0m to grant Granted Mar 31, 2026
18/885,396
Patent 12585743
INTERFACE ACCESS PROCESSING METHOD, COMPUTER DEVICE AND STORAGE MEDIUM
1y 6m to grant Granted Mar 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

4-5
Expected OA Rounds
62%
Grant Probability
88%
With Interview (+25.8%)
3y 3m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 149 resolved cases by this examiner. Grant probability derived from career allowance rate.