Last updated: May 29, 2026
Application No. 18/054,009
SYSTEMS AND METHODS FOR SOLVING MULTI-AGENT DECISION PROCESSES WITH NETWORK CONSTRAINTS

Final Rejection §103
Filed
Nov 09, 2022
Examiner
MAIDO, MAGGIE T
Art Unit
2129
Tech Center
2100 — Computer Architecture & Software
Assignee
Salesforce Inc.
OA Round
2 (Final)
Interview Optional

— +27.6% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 62% grant rate with +27.6% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 39 resolved cases, 2023–2026
Examiner Intelligence

MAIDO, MAGGIE T View full profile →
Grants 62% of resolved cases
Career Allowance Rate
24 granted / 39 resolved
+6.5% vs TC avg
Strong +28% interview lift
Without
With
+27.6%
Interview Lift
resolved cases with interview
Typical timeline
4y 1m
Avg Prosecution
33 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
2.4%
-37.6% vs TC avg
§103
94.2%
+54.2% vs TC avg
§102
0.5%
-39.5% vs TC avg
§112
2.9%
-37.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 39 resolved cases
Office Action

§103
DETAILED ACTION

Response to Amendment
The amendment filed on 26 January 2026 has been entered.
Claims 1-20 are pending.
Claims 1, 9, 17 are amended.
Applicant’s amendments to the Claims have overcome each and every rejection under 35 USC 101 previously set forth in the Non-Final Office Action mailed 25 September 2025.

 Response to Arguments
Applicant’s remarks, regarding the rejections of claims under 35 USC 103, have been fully considered.

	Applicant has amended independent Claim 1 to recite, among other things: 
	"determining, for each agent of the plurality of agents independently without requiring knowledge of policies or actions of other agent of the plurality of agents, policies for the plurality of agents," 
	Applicant submits that both Lu and Vallam fail to teach at least these elements as amended. The combination of Lu and Vallam fails to teach or suggest determining policies for each agent "independently without requiring knowledge of policies or actions of other agent."
	Applicant submits the combination of Lu and Vallam fails to teach, suggest, or disclose all of the particular features recited in amended independent Claim 1. Therefore, Applicant submits Claim 1 is allowable over the cited references. Independent Claims 9 and 17 are amended similarly and are therefore allowable over the cited references for the same reasons.
	Examiner notes Applicant’s arguments, as outlined above, are directed to newly amended claim limitations for which Examiner has not yet made a prima facie case for, rendering Applicant’s arguments moot.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Objections
Claim 1 and analogous claims 9, 17, are objected to because of the following informalities: “the communication network” in line 19 should be “the communication network resources”. Appropriate correction is required.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1-3, 6-11, 14-19 are rejected under 35 U.S.C. 103 as being unpatentable over Lu et al. (U.S. Pre-Grant Publication No. 20230113168, hereinafter ‘Lu'), in view of Vallam et al. (U.S. Pre-Grant Publication No. 20200074328, hereinafter 'Vallum'). 

Regarding claim 1 and analogous claims 9, 17, Lu teaches A system for controlling access to a communication network resource, the system comprising ([0035] FIG. 1B is a block diagram of an example decentralized policy gradient (PG) system 150, in accordance with an example embodiment. Each agent 104 of a plurality of agents 104 resides on a computing node 10 of a collaboration network 166 (also referred to as peer-to-peer communication network 166 herein) and interacts with the environment 108 as well as its neighboring agents 104. (The skilled artisan will be familiar with implementing various peer-to-peer communication networks, including optical-based networks, wired networks, wireless networks, or any combination thereof) In this decentralized architecture, the collaboration network 166 provides communication between the agents 104 to share information (parameters) with neighboring agents 104.; [0110] In one example embodiment, a cloud computing system is controlled in accordance with the disclosed decentralized policy gradient (PG) technique. Each agent 104 corresponds to a node in the cloud computing system. Each node is attempting to perform computing tasks subject to constraints, such as local computing capacity, inter-node communication capacity, data storage capacity, and the like. The team reward aims to see that all computing tasks are performed in the shortest amount of time subject to the defined constraints.; [0134] Cloud computing is a model of service delivery for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, network bandwidth, servers, processing, memory, storage, applications, virtual machines, and services) that can be rapidly provisioned and released with minimal management effort or interaction with a provider of the service. This cloud model may include at least five characteristics, at least three service models, and at least four deployment models.):
a memory that stores network information and a plurality of processor-executable instructions ([0176] The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing.);
a communication interface that receives characteristics of a plurality of agents, and
constraints for a plurality of resources of a dynamic system including communication network resources having bandwidth, latency, or quota limits ([0030] Generally, systems and methods for reinforcement learning are disclosed, including distributed reinforcement learning problems with safety constraints. In particular, a team of agents that cooperate in a constraints for a plurality of resources of a dynamic system shared environment is considered, where each agent has its individual reward function and safety constraints that involve all agents' joint actions. As such, the agents aim to maximize the team-average long-term return, subject to all the safety constraints. In one or more embodiments, no central controller is needed to coordinate the agents, and both the rewards and constraints are only known to each agent locally/privately. Instead, the a communication interface that receives characteristics of a plurality of agents agents are connected by a peer-to-peer communication network to share information with their neighbors. This problem is formulated as a distributed constrained Markov decision process (D-CMDP) with networked agents.; [0133] It is to be understood that although this disclosure includes a detailed description on cloud computing, implementation of the teachings recited herein are not limited to a cloud computing environment.; [0134] Cloud computing is a model of service delivery for enabling convenient, on-demand network access to a shared pool of configurable including communication network resources having bandwidth, latency, or quota limits computing resources (e.g., networks, network bandwidth, servers, processing, memory, storage, applications, virtual machines, and services) that can be rapidly provisioned and released with minimal management effort or interaction with a provider of the service. This cloud model may include at least five characteristics, at least three service models, and at least four deployment models.); and
one or more hardware processors that read and execute the plurality of processor- executable instructions from the memory to perform operations including ([0175] The present invention may be a system, a method, and/or a computer program product at any possible technical detail level of integration. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.):
performing, at an iterative step ([0025] FIG. 2C illustrates the long-term cumulative reward of the constraints vs. the number of iterations, in accordance with an example embodiment):
determining, for each agent of the plurality of agents independently without requiring knowledge of policies or actions of other agent of the plurality of agents ([0113] In one example embodiment, the individual reward function and the safety constraints are known to the corresponding agent 104 and unknown to the remaining agents 104.),
policies for the plurality of agents that optimize respective reward values based on the plurality of costs, and the characteristics of the plurality of agents, simulating a multi-agent decision process for the dynamic system that comprises the communication network, using the determined policies, the plurality of costs, and the characteristics of the plurality of agents, thereby generating respective reward values and aggregated resource contribution values for the communication network resources ([0035] FIG. 1B is a block diagram of an example policies for the plurality of agents that optimize respective reward values decentralized policy gradient (PG) system 150, in accordance with an example embodiment. Each agent 104 of a plurality of agents 104 resides on a computing node 10 of a for the dynamic system that comprises the communication network collaboration network 166 (also referred to as peer-to-peer communication network 166 herein) and interacts with the environment 108 as well as its neighboring agents 104. (The skilled artisan will be familiar with implementing various peer-to-peer communication networks, including optical-based networks, wired networks, wireless networks, or any combination thereof) In this decentralized architecture, the collaboration network 166 provides communication between the agents 104 to share information (parameters) with neighboring agents 104. Each agent 104 conducts an update via a model update unit 154 based on the plurality of costs, and the characteristics of the plurality of agents based on constraints provided by a constraint awareness unit 162. A simulating a multi-agent decision process gradient computation is performed locally via a gradient computation unit 158 based on a current local model and information from the environment 108, and the gradient computation is shared using the determined policies, the plurality of costs, and the characteristics of the plurality of agents via the collaboration network 166 with other agents 104 as thereby generating respective reward values and aggregated resource contribution values for the communication network resources global gradient tracking 170 (which approximates global gradient information aimed at optimizing the team reward). In one example embodiment, the model update unit 154, the constraint awareness unit 162, and the gradient computation unit 158 reside on the computing node 10. In one example embodiment, a decentralized policy gradient (PG) method, referred to herein as Safe Dec-PG, is utilized to perform policy optimization based on this D-CMDP model over a network. Convergence guarantees, together with numerical results, showcase the superiority of the disclosed algorithm. The decentralized PG algorithm accounts for the coupled safety constraints with a quantifiable convergence rate in multi-agent reinforcement learning. The algorithm solves a class of decentralized stochastic nonconvex-concave minimax optimization problems, where both the algorithm design and corresponding theoretical analysis are of independent interest, and have applications in decentralized computing under constraints, logical neural networks, games, robotics, recommender systems, and the like.),
incrementing or decrementing the plurality of costs based on the constraints of the communication network resources and the aggregated resource contribution values ([0032] In one example embodiment, the Lagrangian framework is used to formulate the constrained optimization problem as a min-max saddle-point problem, where there are two processes of optimization involved: incrementing or decrementing the plurality of costs based on the constraints of the communication network resources and the aggregated resource contribution values minimization and maximization. The minimization step deals with optimizing the loss function while the maximization step enforces the constraints.),
updating a final reward value based on the generated respective reward values ([0114]
In one example embodiment, each agent 104 is configured to explore interactions with an environment 108 to maximize a cumulative reward through a reinforcement learning process 150.), and
updating a final plurality of costs based on the plurality of costs ([0115] In one example embodiment, the safety constraints are in a form of bounds on a long term cost associated with a joint policy of the agents 104.);
continuing performing the iterative step for a predetermined number of iterations; outputting the final reward value and the final plurality of costs ([0112] In one aspect, a continuing performing the iterative step for a predetermined number of iterations reinforcement learning system 150 includes a plurality of agents 104, each agent 104 having an individual reward function and one or more safety constraints that involve joint actions of the agents 104, wherein each agent 104 maximizes a team-average long-term return in performing the joint actions, subject to the safety constraints 154, 162, 170, and participates in operating a physical system 50; a peer-to-peer communication network 166 configured to connect the plurality of agents 104; and a distributed constrained Markov decision process (D-CMDP) model 154 implemented over the peer-to-peer communication network 166 and configured to perform policy optimization using a decentralized policy gradient (PG) method, wherein the participation of each agent 104 in operating the physical system 50 is based on the D-CMDP model.; [0113] In one example embodiment, the individual reward function and the safety constraints are known to the corresponding agent 104 and unknown to the remaining agents 104.; [0114] In one example embodiment, each agent 104 is configured to explore interactions with an environment 108 to outputting the final reward value maximize a cumulative reward through a reinforcement learning process 150.; [0115] In one example embodiment, the safety constraints are in a form of bounds on a the final plurality of costs long term cost associated with a joint policy of the agents 104.); and
	enforcing access to the communication network resources for a plurality of devices based on the final plurality of costs ([0045] More importantly, each agent has some safety constraints, in the form of bounds on some based on the final plurality of costs long term cost, that involve the joint policy(ies) of all agents. The following model of networked multi-agent constrained Markov decision process (M-CMDP) is introduced to characterize this setting.; [0110] In one example embodiment, a cloud computing system is controlled in accordance with the disclosed decentralized policy gradient (PG) technique. Each agent 104 corresponds to a for a plurality of devices node in the cloud computing system. Each node is attempting to perform computing tasks enforcing access to the communication network resources subject to constraints, such as local computing capacity, inter-node communication capacity, data storage capacity, and the like. The team reward aims to see that all computing tasks are performed in the shortest amount of time subject to the defined constraints.).
Lu fails to teach allocating initial values for a plurality of costs associated with the plurality of resources including the communication network resources;
Vallam teaches allocating initial values for a plurality of costs associated with the plurality of resources including the communication network resources ([0020] In general, the criteria to which an allocating initial values for a plurality of costs initial decision is made will associated with the plurality of resources pertain to one or more resource allocation requests. For example, as will be discussed below with respect to FIG. 3, a resource allocation request can be a request for a ranking of a set of items such as, for example, ranking of a proposal and in award processing.; [0021] Step 204 includes assigning an equal weight of resources specific to the other agents within an agent's model. Step 206 includes generating, for each of the initial decisions, a revised decision based at least in part on (i) comparing the initial decisions to each other, (ii) modeling aggregate group (i.e., the set of the other agents) behavior represented by the initial decisions of the plurality of agents, and (iii) modeling a behavior of the other agents in view of the initial decisions of the plurality of agents. First, the initial decisions will be compared to the other initial decisions such that the agent making the comparison will provide the same or different decision as the initial decision. When the decision is different, the comparison will be based on a prediction by each respective agent.; [0076] Resource provisioning 781 provides dynamic procurement of including the communication network resources computing resources and other resources that are utilized to perform tasks within the cloud computing environment. Metering and Pricing 782 provide cost tracking as resources are utilized within the cloud computing environment, and billing or invoicing for consumption of these resources.); 
Lu and Vallam are considered to be analogous to the claimed invention because they are in the same field of machine learning. In view of the teachings of Lu, it would have been obvious for a person of ordinary skill in the art to apply the teachings of Vallam to Lu before the effective filing date of the claimed invention in order to approximate an agent’s behavior by computing the updated decision/strategy for each agent in the process of collaborative information aggregation (cf. Vallam, [0016] Accordingly, embodiments of the present invention present a system and method that performs dynamic allocation of resources while computing the updated decision/strategy for each agent in the process of collaborative information aggregation and this method leads to efficient approximation of an agent's behavior. The term “behavior” as used herein is understood to mean a potential decision of an agent or of anther agent.).

Regarding claim 2 and analogous claims 10, 18, Lu, as modified by Vallam, teaches The system, The non-transitory machine-readable medium, The method of claims 1, 9, 17, respectively.
Lu teaches wherein determining policies for the plurality of agents includes: determining policies for the plurality of agents that maximizes the respective reward values subject to the constraints for the plurality of resources of the dynamic system ([0112] In one aspect, a reinforcement learning system 150 includes a plurality of agents 104, each agent 104 having an individual reward function and one or more safety constraints that involve joint actions of the agents 104, wherein each agent 104 maximizes the respective reward values maximizes a team-average long-term return in performing the joint actions, subject to the constraints for the plurality of resources of the dynamic system subject to the safety constraints 154, 162, 170, and participates in operating a physical system 50; a peer-to-peer communication network 166 configured to connect the plurality of agents 104; and a distributed constrained Markov decision process (D-CMDP) model 154 implemented over the peer-to-peer communication network 166 and configured to perform determining policies for the plurality of agents policy optimization using a decentralized policy gradient (PG) method, wherein the participation of each agent 104 in operating the physical system 50 is based on the D-CMDP model.).
Lu and Vallum are combinable for the same rationale as set forth above with respect to claim 1.

Regarding claim 3 and analogous claims 11, 19, Lu, as modified by Vallam, teaches The system, The non-transitory machine-readable medium, The method of claims 2, 10, 18, respectively.
Lu teaches wherein determining policies for the plurality of agents includes: computing a Lagrangian having a mixed deterministic Markov policy for each agent ([0032] In one example embodiment, the computing a Lagrangian Lagrangian framework is used to formulate the constrained optimization problem as a min-max saddle-point problem, where there are two processes of optimization involved: minimization and maximization. The minimization step deals with optimizing the loss function while the maximization step enforces the constraints.; [0111] Given the discussion thus far, it will be appreciated that, in general terms, an exemplary method, according to an aspect of the invention, includes the operations of generating a distributed constrained having a mixed deterministic Markov policy for each agent Markov decision process (D-CMDP) model 154 configured to perform policy optimization using a decentralized policy gradient (PG) method; maximizing a team-average long-term return in performing one or more joint actions, subject to one or more safety constraints, based on an individual reward function 154, 162, 170; and participating in operating a physical system 50 based on the D-CMDP model.), and 
taking an expectation with respect to a probability distribution of the respective reward values induced by the mixed deterministic Markov policy ([0116] In one example embodiment, each agent 104 is described by a tuple (
    PNG
    media_image1.png
    33
    25
    media_image1.png
    Greyscale
,{
    PNG
    media_image2.png
    38
    33
    media_image2.png
    Greyscale
 i
    PNG
    media_image3.png
    38
    67
    media_image3.png
    Greyscale
,P,{Ri
    PNG
    media_image3.png
    38
    67
    media_image3.png
    Greyscale
,
    PNG
    media_image4.png
    38
    25
    media_image4.png
    Greyscale
,{Ci
    PNG
    media_image3.png
    38
    67
    media_image3.png
    Greyscale
,γ) where 
    PNG
    media_image1.png
    33
    25
    media_image1.png
    Greyscale
 comprises a state space shared by the agents 104, a graph 
    PNG
    media_image4.png
    38
    25
    media_image4.png
    Greyscale
 represents the peer-to peer communication network 166, 
    PNG
    media_image5.png
    38
    38
    media_image5.png
    Greyscale
=Πi=1 n 
    PNG
    media_image2.png
    38
    33
    media_image2.png
    Greyscale
 i comprises a joint action space of the agents 104, Ri: 
    PNG
    media_image1.png
    33
    25
    media_image1.png
    Greyscale
×
    PNG
    media_image2.png
    38
    33
    media_image2.png
    Greyscale
→
    PNG
    media_image6.png
    42
    29
    media_image6.png
    Greyscale
 and Ci: 
    PNG
    media_image1.png
    33
    25
    media_image1.png
    Greyscale
×
    PNG
    media_image2.png
    38
    33
    media_image2.png
    Greyscale
 
    PNG
    media_image6.png
    42
    29
    media_image6.png
    Greyscale
 comprise local rewards and cost functions of corresponding agent i, P: 
    PNG
    media_image1.png
    33
    25
    media_image1.png
    Greyscale
×
    PNG
    media_image2.png
    38
    33
    media_image2.png
    Greyscale
×
    PNG
    media_image1.png
    33
    25
    media_image1.png
    Greyscale
→[0,1] comprises a state transition probability of the Markov decision process, and γ∈(0,1) denotes a discount factor.; [0117] In one example embodiment, an objective of the agents 104 is to collaboratively taking an expectation with respect to a probability distribution of the respective reward values maximize a globally average return over the peer-to-peer communication network 166, dictated by R(
    PNG
    media_image7.png
    29
    25
    media_image7.png
    Greyscale
,
    PNG
    media_image8.png
    29
    25
    media_image8.png
    Greyscale
)=n−1.
    PNG
    media_image9.png
    42
    83
    media_image9.png
    Greyscale
Ri(
    PNG
    media_image7.png
    29
    25
    media_image7.png
    Greyscale
,
    PNG
    media_image8.png
    29
    25
    media_image8.png
    Greyscale
), with only local observations of the local rewards, induced by the mixed deterministic Markov policy subject to the corresponding safety constraints dictated by Ci(
    PNG
    media_image7.png
    29
    25
    media_image7.png
    Greyscale
,
    PNG
    media_image8.png
    29
    25
    media_image8.png
    Greyscale
), and wherein each agent 104 is associated with m of the cost functions and Ci(
    PNG
    media_image7.png
    29
    25
    media_image7.png
    Greyscale
,
    PNG
    media_image8.png
    29
    25
    media_image8.png
    Greyscale
) comprises a mapping 
    PNG
    media_image1.png
    33
    25
    media_image1.png
    Greyscale
×
    PNG
    media_image2.png
    38
    33
    media_image2.png
    Greyscale
 to 
    PNG
    media_image6.png
    42
    29
    media_image6.png
    Greyscale
 m.).
Lu and Vallum are combinable for the same rationale as set forth above with respect to claim 1.

Regarding claim 6 and analogous claim 14, Lu, as modified by Vallam, teaches The system, The non-transitory machine-readable medium of claims 1, 9, respectively.
Lu teaches wherein: the communication interface further receives a learning rate value, and the incrementing or decrementing is further based on the learning rate value ([0035] In one example embodiment, a decentralized the communication interface further receives a learning rate value policy gradient (PG) method, referred to herein as Safe Dec-PG, is utilized to perform policy optimization based on this D-CMDP model over a network. Convergence guarantees, together with numerical results, showcase the superiority of the disclosed algorithm. The decentralized PG algorithm accounts for the coupled safety constraints with a quantifiable convergence rate in multi-agent reinforcement learning. The the incrementing or decrementing is further based on the learning rate value algorithm solves a class of decentralized stochastic nonconvex-concave minimax optimization problems, where both the algorithm design and corresponding theoretical analysis are of independent interest, and have applications in decentralized computing under constraints, logical neural networks, games, robotics, recommender systems, and the like.).
Lu and Vallum are combinable for the same rationale as set forth above with respect to claim 1.

Regarding claim 7 and analogous claim 15, Lu, as modified by Vallam, teaches The system, The non-transitory machine-readable medium of claims 1, 9, respectively.
Lu teaches wherein the incrementing or decrementing is based on respective differences between the constraints and the aggregated resource contribution values associated with respective constraints ([0032] In one example embodiment, the Lagrangian framework is used to formulate the constrained optimization problem as a min-max saddle-point problem, where there are two processes of optimization involved: incrementing or decrementing is based on respective differences between the constraints and the aggregated resource contribution values associated with respective constraints minimization and maximization. The minimization step deals with optimizing the loss function while the maximization step enforces the constraints.).
Lu and Vallum are combinable for the same rationale as set forth above with respect to claim 1.

Regarding claim 8 and analogous claim 16, Lu, as modified by Vallam, teaches The system, The non-transitory machine-readable medium of claims 1, 9, respectively.
Lu teaches wherein determining policies for the plurality of agents is performed on a subset of the plurality of agents at each time step ([0119] In one example embodiment, performed on a subset of the plurality of agents at each time step at time t, each agent i selects an action 
    PNG
    media_image8.png
    29
    25
    media_image8.png
    Greyscale
 i t given a state 
    PNG
    media_image7.png
    29
    25
    media_image7.png
    Greyscale
 t according to a determining policies for the plurality of agents corresponding local policy, πi: 
    PNG
    media_image1.png
    33
    25
    media_image1.png
    Greyscale
→Δ(
    PNG
    media_image2.png
    38
    33
    media_image2.png
    Greyscale
 i), which is parametrized as πw i by a parameter wi∈Θi with dimension di; and each agent 104 learns a joint policy πw i : 
    PNG
    media_image1.png
    33
    25
    media_image1.png
    Greyscale
→Δ(
    PNG
    media_image2.png
    38
    33
    media_image2.png
    Greyscale
) given by πθ(
    PNG
    media_image7.png
    29
    25
    media_image7.png
    Greyscale
,
    PNG
    media_image8.png
    29
    25
    media_image8.png
    Greyscale
)=
    PNG
    media_image10.png
    38
    79
    media_image10.png
    Greyscale
πw i (
    PNG
    media_image7.png
    29
    25
    media_image7.png
    Greyscale
,
    PNG
    media_image8.png
    29
    25
    media_image8.png
    Greyscale
 i) with θ=[wi T . . . wn T]T∈
    PNG
    media_image6.png
    42
    29
    media_image6.png
    Greyscale
 d, and where d∈Σi=1 ndi denotes a whole problem dimension.).
Lu and Vallum are combinable for the same rationale as set forth above with respect to claim 1.

Claims 4-5, 12-13, 20 are rejected under 35 U.S.C. 103 as being unpatentable over Lu et al. (U.S. Pre-Grant Publication No. 20230113168, hereinafter ‘Lu'), in view of Vallam et al. (U.S. Pre-Grant Publication No. 20200074328, hereinafter 'Vallum'), and further in view of Sadeghianpourhamami et al. (U.S. Pre-Grant Publication No. 20220041076, hereinafter 'Sadeghianpourhamami'). 

Regarding claim 4 and analogous claims 12, 20, Lu, as modified by Vallam, teaches The system, The non-transitory machine-readable medium, The method of claims 1, 9, 17, respectively.
Lu, as modified by Vallam, fails to teach wherein the final plurality of costs is a weighted average of the costs over multiple iterative steps.
Sadeghianpourhamami teaches wherein the final plurality of costs is a weighted average of the costs over multiple iterative steps ([0053] As will be described below, the agent selector policy may be trained to select the control agent that achieves the best predicted performance score. What is considered to be “best” may vary according to the embodiment, and may include, for example, lowest cost, most efficient, highest usage of renewable energy, complexity of the control policy (e.g. lower complexity may be considered better), response time of the control agent (e.g. a shorter amount of time to solve the problem may be considered better), or any combination thereof. In an embodiment, the score may be based on a final plurality of costs is a weighted average of the costs over multiple iterative steps weighted average of cost/reward and control policy complexity. In an embodiment, the control agent(s) with the best score is identified, and if there are multiple control agents having the best score, then one of these control agents may be selected based on some factor(s). The factor may be a lowest complexity of its control policy relative to the other control agent(s) with the best score, and/or a faster performing agent. Accordingly, the best control agent may be the agent that is predicted to achieve the lowest cost function or highest reward function for the given control environment state.).
Lu, Vallam, and Sadeghianpourhamami are considered to be analogous to the claimed invention because they are in the same field of machine learning. In view of the teachings of Lu and Vallam, it would have been obvious for a person of ordinary skill in the art to apply the teachings of Sadeghianpourhamami to Lu before the effective filing date of the claimed invention in order to improve control of methods and systems resulting from models or policies having improved accuracy (cf. Sadeghianpourhamami, [0043] Such methods and systems may provide for improved control of power or energy systems. The improved control may result from control models or policies having improved accuracy, and/or from control models or policies having improved scalability, for example meaning the systems may have a higher number of controllable assets.).

Regarding claim 5 and analogous claim 13, Lu, as modified by Vallam, teaches The system, The non-transitory machine-readable medium of claims 1, 9, respectively.
Lu, as modified by Vallam, fails to teach wherein the final reward value is a weighted average of the respective reward values over multiple iterative steps.
Sadeghianpourhamami teaches wherein the final reward value is a weighted average of the respective reward values over multiple iterative steps ([0053] As will be described below, the agent selector policy may be trained to select the control agent that achieves the best predicted performance score. What is considered to be “best” may vary according to the embodiment, and may include, for example, lowest cost, most efficient, highest usage of renewable energy, complexity of the control policy (e.g. lower complexity may be considered better), response time of the control agent (e.g. a shorter amount of time to solve the problem may be considered better), or any combination thereof. In an embodiment, the score may be based on a final reward value is a weighted average of the respective reward values over multiple iterative steps weighted average of cost/reward and control policy complexity. In an embodiment, the control agent(s) with the best score is identified, and if there are multiple control agents having the best score, then one of these control agents may be selected based on some factor(s). The factor may be a lowest complexity of its control policy relative to the other control agent(s) with the best score, and/or a faster performing agent. Accordingly, the best control agent may be the agent that is predicted to achieve the lowest cost function or highest reward function for the given control environment state.).
Lu, Vallam, and Sadeghianpourhamami are combinable for the same rationale as set forth above with respect to claim 4.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MAGGIE MAIDO whose telephone number is (703) 756-1953. The examiner can normally be reached M-Th: 6am - 4pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael Huntley can be reached on (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/MM/Examiner, Art Unit 2129
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129
Read full office action
Prosecution Timeline

Nov 09, 2022
Application Filed
Sep 25, 2025
Non-Final Rejection mailed — §103
Jan 26, 2026
Applicant Interview (Telephonic)
Jan 26, 2026
Examiner Interview Summary
Jan 26, 2026
Response Filed
Apr 02, 2026
Final Rejection mailed — §103
May 21, 2026
Examiner Interview Summary
May 21, 2026
Applicant Interview (Telephonic)
Precedent Cases

Applications granted by this same examiner with similar technology

17/196,689
Patent 12639595
INFORMATION PROCESSING DEVICE, INFORMATION COMPUTING METHOD, AND NON-TRANSITORY COMPUTER READABLE STORAGE MEDIUM
5y 2m to grant Granted May 26, 2026
17/330,099
Patent 12602603
MULTI-AGENT INFERENCE
4y 10m to grant Granted Apr 14, 2026
17/392,319
Patent 12596933
CONTEXT-AWARE ENTITY LINKING FOR KNOWLEDGE GRAPHS TO SUPPORT DECISION MAKING
4y 8m to grant Granted Apr 07, 2026
17/062,058
Patent 12579463
GENERATIVE REASONING FOR SYMBOLIC DISCOVERY
5y 5m to grant Granted Mar 17, 2026
17/659,028
Patent 12579452
EVALUATION SCORE DETERMINATION MACHINE LEARNING MODELS WITH DIFFERENTIAL PERIODIC TIERS
3y 11m to grant Granted Mar 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
62%
Grant Probability
89%
With Interview (+27.6%)
4y 1m (~6m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 39 resolved cases by this examiner. Grant probability derived from career allowance rate.
SYSTEMS AND METHODS FOR SOLVING MULTI-AGENT DECISION PROCESSES WITH NETWORK CONSTRAINTS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email