Last updated: May 04, 2026

Application No. 18/275,881

JOINTLY UPDATING AGENT CONTROL POLICIES USING ESTIMATED BEST RESPONSES TO CURRENT CONTROL POLICIES

Non-Final OA §102

Filed

Aug 04, 2023

Priority

Feb 05, 2021 — provisional 63/146,570 +1 more

Examiner

CHEN, ALAN S

Art Unit

2125

Tech Center

2100 — Computer Architecture & Software

Assignee

Deepmind Technologies Limited

OA Round

1 (Non-Final)

Interview Optional

— +6.3% interview lift. Interview lift (+6.3%) is below the 15.0% threshold. A written response is recommended.

Based on 1130 resolved cases, 2023–2026

Examiner Intelligence

CHEN, ALAN S View full profile →

Grants 91% — above average

Career Allowance Rate

1029 granted / 1130 resolved

+36.1% vs TC avg

Moderate +6% lift

Without

With

+6.3%

Interview Lift

resolved cases with interview

Typical timeline

2y 9m

Avg Prosecution

19 currently pending

Career history

1149

Total Applications

across all art units

Statute-Specific Performance

§101

12.7%

-27.3% vs TC avg

§103

20.9%

-19.1% vs TC avg

§102

37.5%

-2.5% vs TC avg

§112

19.9%

-20.1% vs TC avg

Black line = Tech Center average estimate • Based on career data from 1130 resolved cases

Office Action

§102

Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA. Specification The disclosure is objected to because of the following informalities: On pg. 4, ln 2…”joint control polices”, should be ‘joint control policies’ On pg. 4, ln 16…”The autonomous may be configure to move…”, appears to be missing a noun On pg. 4, ln 25…”the control data may transmitted to one or more actuators…”, is not grammatically correct Appropriate correction is required. Claim Rejections - 35 USC § 102 The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action: A person shall be entitled to a patent unless – (a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale , or otherwise available to the public before the effective filing date of the claimed invention. Claim s 1- 4, 7, 11, 18-22 and 25 are rejected under 35 U.S.C. 102 (a)(1) as being anticipated by A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning to Lanctot et al. (hereinafter Lanctot ) . Per claim 1, Lanctot discloses A method ( Abstract … computer-implemented Multiagent Reinforcement Learning (MARL) method ) performed by one or more computers ( Abstract and Section 1… uses a "scalable implementation" and "policies are represented as separate neural networks" intrinsically requiring one or more computers ) for learning a respective control policy for each of a plurality of agents interacting with an environment ( Abstract… "To achieve general intelligence, agents must learn how to interact with others in a shared environment: this is the challenge of multiagent reinforcement learning (MARL)" ) , the method comprising, at each of a plurality of iterations ( Section 3, Algorithm 1 … ("Policy-Space Response Oracles" operates continuously over a plurality of iterative epochs , “ while epoch e in {1, 2, ...} do ” ) : obtaining data specifying a current joint control policy for the plurality of agents as of the iteration ( Section 3, Algorithm 1… initializes and utilizes a joint strategy profile , e.g., "meta-strategies" for the agents at each iteration , “ Initialize meta-strategies σ i = UNIFORM(Π i ) ”, where t he combination of these individual meta-strategies across all players forms the current joint control policy ) , the current joint control policy specifying a respective current control policy for each of the plurality of agents ( Section 3.1, Algorithm 1…d uring the iteration s , the algorithm obtains the opponents' current respective control policies to sample environment interactions , “ Sample π -i ~ σ - i ”, σ - i being other players' meta-strategy ) ; and updating the current joint control policy by updating each of the respective current control policies for each of the plurality of agents ( Section 3, Algorithm 1 … updates the joint control policy by computing new meta-strategies (respective current control policies) for the agents , “ Compute a meta-strategy σ from U Π ” and “ Output current solution strategy σ i for player i” ) , comprising: for each agent ( Section 3, Algorithm 1… performs the subsequent oracle computation steps individually for every agent via a loop , “ for player i ∈ [[n]] do ” ) : generating a respective reward estimate for each of a plurality of alternate control policies that is an estimate of a reward received by the agent if the agent is controlled using the alternate control policy while the other agents are controlled using the respective current control policies ( Section 4 and Fig. 7… the oracles use the Reactor RL architecture, which relies on "Retrace(λ) for off-policy policy evaluation" , where o ff-policy evaluation mathematically generates expected reward/value estimates , such as Q(x,a) action-values shown in Fig 7 , for alternate target control policies based on trajectory data, while the other agents act using their respective current control policies ; Section 3 , Algorithm 1 … teaches generating expected utilities (reward estimates) for alternate control policies played against the opponents' current control policies in order to find a best response , the PSRO algorithm uses deep reinforcement learning to search for a best response , “ Train oracle π' i over ρ ~ (π' i , π -i ) ” ; and computing a best response for the agent from the respective reward estimates ( Section 3… computing a best response based on the reward estimates derived via RL , “ The meta-game is represented as an empirical game, starting with a single policy (uniform random) and growing, each epoch, by adding policies (“oracles”) that approximate best responses to the meta-strategy of the other players ”; Algorithm 1… computes exact best response oracle via the training loop , “ Train oracle π' i ” ) ; and updating the respective current control policies for the agents using the best responses for the agents ( Section 3, Algorithm 1… updates the respective current control policies (the meta-strategies) by incorporating the newly calculated best responses (the oracles) , e.g., best response oracle π' i added to the agent's available policy set: Π i = Π i ∪ {π' i } , then computes new expected utilities from this expanded policy set , “ Compute missing entries in U Π from Π ”, and updates the respective current control policies using the set containing the best responses , “ Compute a meta-strategy σ from U Π ” ) . Per claim 2, Lanctot discloses claim 1, further disclosing at each iteration t ( Section 3, Algorithm 1…operating iteratively in a loop, “ while epoch e in {1, 2, ...} do ”, ending with updating strategies ) , updating the respective current control policies comprises: for each agent p, updating a set Π p t-1 that includes each previous best response for the agent p computed at previous iterations to include the best response for the current iteration, generating an updated set n ( Section 3, Algorithm 1 … maintain ing a set of policies for each agent and expands it by adding the newly trained oracle (approximate best response) at each epoch , where the update rule for each player is shown in Algorithm 1 as, “ Π i = Π i ∪ {π' i }, where π' i “ is the oracle trained in the current iteration ; Section 3… "Define Π i = Π i ∪ π' i , as the policy space including the currently learning oracles … " ) . determining a combined set Π t = Ⓧ p Π p t , wherein Ⓧ p is an outer product across elements of the sets Π i t for each agent i (Section 2…forming a joint strategy space constituting the combinations of all individual agents' policies to evaluate empirical game expected utilities , the joint strategy space defined as Π = Π i , …, Π n ; S ection 3 . … tracking expected utilities for the joint combinations inside an empirical tensor U Π , noting the space required is K n for K policies and n players , where tensor containing all policy combinations represents the Cartesian outer product of the individual sets , “ the current empirical payoff tensor U Π (computed via multiple tensor dot products ” ) ; and updating the current joint control policy by generating a distribution σ t ​ across Π t comprising, for each joint control policy Π in Π t , a likelihood that the plurality of agents execute the joint control policy Π (Section 3, Algorithm 1 … computes a meta-strategy, which is a joint probability distribution over the combined strategy space , Algorithm 1 updat ing the distribution , “ Compute a meta-strategy σ from U Π ”; Section 2… this is a probability distribution assigning a likelihood of execution: "by sampling from a mixture (distribution) over them σ i ∈ ∆ ( Π i )" , where t he combined mixture dictates the likelihood that the joint control policies in the combined set Π t are executed ) . Per claim 3, Lanctot discloses claim 2, further disclosing at a first iteration: the current joint control policy is determined to be an initial joint control policy π 0 Section 3… initializes the empirical game at the first iteration with a single starting joint policy , "The meta-game is represented as an empirical game, starting with a single policy (uniform random)..." ) ; and the combined set is an initial combined set Π 0 ={π 0 } ( Algorithm 1… initializ ing the distribution prior to the loop , “ Initialize meta-strategies σ i = UNIFORM(Π i ) ”, since Π i initially consists of just the single starting policy, the initial combined set naturally consists of exactly one joint control policy ) . Per claim 4, Lanctot discloses claim 1, further disclosing computing the best response BR p t for agent p at iteration t comprises computing or estimating: BR p t ∈ argmax πp ∈ Πp ∗ ∑ σ t (π)R p (π ’ p ,π −p )... ( Section 3, Algorithm 1… finding a best response policy that maximizes expected utility/reward against the expected mixed probability distribution of the opponents' policies done via reinforcement learning (RL) by computing an approximate best response to the opponents' joint meta-strategy distribution σ - i via RL , Sample π -i ~ σ - i ” and “ Train oracle π' i over ρ ~ (π' i , π -i ) ”; Section 3.1 … the expected utility value function mapped to this state as u i ( π i,k , σ - i ) , where u sing RL to train the oracle to maximize this expected return directly estimates the claimed mathematical argmax formula against the opponent joint distribution ) . Per claim 7, Lanctot discloses claim 1, further disclosing updating the current joint control policy comprises updating the current joint control policy using a metasolver that is configured to select a correlated equilibrium or a coarse-correlated equilibrium ( Section 3.1…able to update the meta-strategy using meta-solvers configured to calculate a correlated equilibrium , "A meta-strategy solver takes as input the empirical game... We try three different solvers: regret-matching, Hedge, and projected replicator dynamics" ; Appendix A.1 … "Regret matching (RM) is a simple adaptive procedure that leads to correlated equilibria [31]..." ) . Per claim 11, Lanctot discloses claim 1, further disclosing executing the control policy generated during the final iteration ( Section 1… generat ing the control policies specifically to be deployed and executed , “ we assume centralized training for decentralized execution" ; Algorithm 1 … outputs the policies after final iteration, ” Output current solution strategy σ i for player i”; Section 4 …experiments executing the generated policies in testing environments to prove they function ) . Claims 18 and 19 are substantially similar in scope and spirit to claim 1. Therefore, the rejection of claim 1 is applied accordingly. Claims 2 0 - 22 and 25 are substantially similar in scope and spirit to claims 2-4 and 7 , respectively. Therefore, the rejections of claims 2-4 and 7 are applied accordingly. Claim Rejections - 35 USC § 102 The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action: A person shall be entitled to a patent unless – (a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale , or otherwise available to the public before the effective filing date of the claimed invention. Claim s 1-12 and 18-25 are rejected under 35 U.S.C. 102 (a)(1) as being anticipated by Multi-Agent Training beyond Zero-Sum with Correlated Equilibrium Meta-Solvers to Marris et al. (hereinafter Marris , cited in IDS ) . Per claim 1, Marris discloses A method ( Section 1… Joint Policy-Space Response Oracles (JPSRO) algorithm trains sets of policies for n-player games , “ we propose a novel training algorithm, Joint Policy-Space Response Oracles (JPSRO), to train policies on n-player, general-sum extensive form games. JPSRO requires the solution of a meta-game, and we propose using MG(C)CE as a metasolver ”) performed by one or more computers ( Section 6…JPSRO is performed by computers, as indicated by the computational experiments and scaling properties discussed , “ Experiments were ran for up to 6 hours, after which they were terminated ”; Section 7…” This paper provides a novel approach which is computationally tractable, supports general-support solutions, and has favourable scaling properties when the solution is full-support ” ) for learning a respective control policy for each of a plurality of agents interacting with an environment ( Section 1…JPSRO is for training agents , e.g., learning a control policy , in multi-agent games, defining these games as environments , "We mean games (also called environments) in a very general sense") , the method comprising, at each of a plurality of iterations ( Section 5, Algorithm 2…t he JPSRO algorithm (Algorithm 2) utilizes an iterative loop for time step t , "for t←{1,...} do") : obtaining data specifying a current joint control policy for the plurality of agents as of the iteration, the current joint control policy specifying a respective current control policy for each of the plurality of agents ( Section 5… At each iteration, the meta-solver (MS) evaluates and tracks a joint distribution of policies , "The MS is a function taking in the ER and returning a joint distribution, σ t , over Π 0:t , such that σ t (π) is the probability to play joint policy π ∈ Π 0:t at iteration t" , where distribution (σ t ) defines the current joint policy specifying what respective policies each player plays) ; and updating the current joint control policy by updating each of the respective current control policies for each of the plurality of agents ( Section 5, Algorithm 2 … JPSRO expands the policy set for each player over the iterations, updating the overall joint policy , where Algorithm 2 specifies iterating over each player , "for p←{1,...,n} do") to update the policy sets Π p 0:t ←Π p 0:t−1 ∪ { 1 π p t ,...}, and then updating the joint distribution via the meta solver : σ t ←MS(G 0:t )) , comprising: for each agent: generating a respective reward estimate for each of a plurality of alternate control policies that is an estimate of a reward received by the agent if the agent is controlled using the alternate control policy while the other agents are controlled using the respective current control policies ( Section 5… calculat ion of expected return (reward estimate) an agent receives when evaluating alternate policies against the opponent's current policies , “ The expected return (ER), an NF game (G 0:t ) p=1 .. n , is tracked for each joint policy found so far such that G 0:t p (π) is the expected return to player p when playing joint policy π . We also define G p * to be the payoff over all possible joint policies. ”; Section 5.1…t he best response (BR) operator calculates this estimate explicitly as: ∑ ​​σ t ​(π −p ​)G ∗ p ​(π ∗ p ​,π −p ​) , where this formula estimates the reward G ∗ p ​ for player p if they play alternate policy π ∗ p ​ ​ while the other players (−p) are controlled by their respective current policies π −p ​ drawn from the joint distribution σ t​ ) ; and computing a best response for the agent from the respective reward estimates ( Section 5.1… JPSRO computes a best response (BR) that maximizes the estimated reward) ; and updating the respective current control policies for the agents using the best responses for the agents ( Section 5.1… expands the agents' current control policies by including the newly computed best responses , "expands the player policy set, Π p 0:t+1 = Π p 0:t ∪ Π p t+1 , where Π p t+1 ={BR pt+1 }" ; Section 5, Algorithm 2 … show n in line 7 before using these updated sets to generate a new joint control policy distribution in line 9 via the meta-solver (σ t ←MS(G 0:t ) ) ) . Per claim 2, Marris discloses claim 1, further disclosing at each iteration t, updating the respective current control policies comprises: for each agent p, updating a set Π p t-1 that includes each previous best response for the agent p computed at previous iterations to include the best response for the current iteration, generating an updated set n ( Section 5.1… expanding the policy sets for each agent (player) by including the current iteration's best response , "At iteration t+1 each set, Π p 0:t , can be expanded... which expands the player policy set, Π p 0:t+1 = Π p 0:t ∪ Π p t+1 , where Π p t+1 = {BR p t+1 }") . determining a combined set Π t = Ⓧ p Π p t , wherein Ⓧ p is an outer product across elements of the sets Π i t for each agent i (Section 5…tracking the sets of policies found for each player p and determining the combined set of joint policies by taking the Cartesian product (outer product) across them , "The set of all policies found so far for player p is denoted Π p 0:t and the set of joint policies is denoted Π 0:t = Ⓧ p Π p 0:t ") ; and updating the current joint control policy by generating a distribution σ t ​ across Π t comprising, for each joint control policy Π in Π t , a likelihood that the plurality of agents execute the joint control policy Π (Section 5…utilizing a meta-solver to output a distribution specifying the likelihood of executing each joint control policy, "The MS is a function taking in the ER and returning a joint distribution, σ t , over Π 0:t , such that σ t (π) is the probability to play joint policy π ∈ Π 0:t at iteration t") . Per claim 3, Marris discloses claim 2 , further disclosing at a first iteration: the current joint control policy is determined to be an initial joint control policy π 0 ; and the combined set is an initial combined set Π0={π0} ( Section 5… Algorithm 2 (JPSRO) shows initializing the sets with initial control policies and evaluating an initial combined set to define the initial joint control policy , where Steps 1-3 disclose , "1: Π 1 0 ,...,Π n 0 ←{π 1 0 },...,{π n 0 }, 2: G 0 ←ER(Π 0 ), 3: σ 0 ←MS(G 0 )") . Per claim 4, Marris discloses claim 1 , further disclosing computing the best response BR p t for agent p at iteration t comprises computing or estimating: BR p t ∈ argmax πp ∈ Πp ∗ ∑ σ t (π)R p (π ’ p ,π −p )... ( Section 5.1… operator formula , “ BR p t+1 …”, where Gp ∗ serves identically as the expected reward estimate Rp.) . Per claim 5, Marris discloses claim 1 , further disclosing computing a best response for the agent p comprises computing a respective best response for the agent p for each control policy π p that has a non-zero likelihood under the joint control policy corresponding to the previous iteration ( Section 5.1… the JPSRO(CE) framework loops and computes a best response based only on the policies that the agent has a non-zero probability of receiving as a recommendation , "In practice, we only calculate a BR for positive support policies") . Per claim 6, Marris discloses claim 5 , further disclosing computing the best response BR p t for agent p at iteration t corresponding to control policy v p comprises computing or estimating: BR p t (v p )… ( Section 5.1…computation, using the conditional probability σ(π −p | π p ), in the JPSRO(CE) Best Response operator formula , " BR p t+1 (π p ) …”. Per claim 7, Marris discloses claim 1 , further disclosing updating the current joint control policy comprises updating the current joint control policy using a metasolver that is configured to select a correlated equilibrium or a coarse-correlated equilibrium ( Section 5.2… " W e propose that (C)CEs are good candidates as meta-solvers (MSs) … we propose three flavours of equilibrium MSs … " and uses Correlated Equilibrium (CE) and Coarse Correlated Equilibrium (CCE) solvers) . Per claim 8, Marris discloses claim 7 , further disclosing the meta-solver is configured to use a Gini impurity measure to select a correlated equilibrium or a coarse-correlated equilibrium ( Section 3… the use of Gini impurity to select an equilibrium from the polytope, introducing the "Maximum Gini (Coarse) Correlated Equilibrium (MG(C)CE)" as the novel solution concept functioning as the meta-solver) . Per claim 9, Marris discloses claim 8 , further disclosing the meta-solver is configured to compute the current joint control policy x * by maximizing: max x … ( Section 3.1… quadratic program solved by the meta-solver as , " Gini objective: max σ − 1/2 σ T σ ” … A p σ≤ϵ ∀ p … σ ≥ 0, e T σ = 1" . Per claim 10, Marris discloses claim 9 , further disclosing the meta-solver computes the current joint policy x* by computing one of: x ∗ = CA T α ∗ + Cβ ∗ + b or x ∗ =CA T α ∗ + b … ( Section 4.2…Equations 11 and 12 provide these exact optimal solutions depending on support parameters in “ Theorem 2 (Scalable Representation) ” . Per claim 11, Marris discloses claim 1 , further disclosing executing the control policy generated during the final iteration ( Section E.4… evaluating the policies by measuring value (undiscounted return) which requires execution ; Section 4.1… playing the game with the final generated distributions: "At action selection time only (C)CEs require further coordination... [the] correlation device will recommend actions from the equilibrium that was previously agreed upon", demonstrating that the policy is executed during play) . Per claim 12, Marris discloses claim 1 , further disclosing the reward estimate for each alternate control policy is based on rewards obtained by controlling the respective agent to perform a task by acting upon a real world environment, the controlling being performed by generating control data for the agent based on the alternate control policy ( Section 6… experiments are conducted via game environments that simulate potentially real world environments, i.e., OpenSpiel games like Kuhn Poker, Trade Comm, and Sheriff) . Claims 18 and 19 are substantially similar in scope and spirit to claim 1. Therefore, the rejection of claim 1 is applied accordingly. Claims 20-25 are substantially similar in scope and spirit to claims 2 -7 , respectively . Therefore, the rejections of claims 2 -7 are applied accordingly. Conclusion The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Patents and/or related publications are cited in the Notice of References Cited (Form PTO-892) attached to this action to further show the state of the art with respect to multi-agent reinforcement learning system for jointly updating control policies. Any inquiry concerning this communication or earlier communications from the examiner should be directed to FILLIN "Examiner name" \* MERGEFORMAT ALAN CHEN whose telephone number is FILLIN "Phone number" \* MERGEFORMAT (571)272-4143 . The examiner can normally be reached FILLIN "Work Schedule?" \* MERGEFORMAT M-F 10-7 . Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, FILLIN "SPE Name?" \* MERGEFORMAT Kamran Afshar can be reached at FILLIN "SPE Phone?" \* MERGEFORMAT (571) 272-7796 . The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /ALAN CHEN/ Primary Examiner, Art Unit 2125

Read full office action

Prosecution Timeline

Aug 04, 2023

Application Filed

Feb 25, 2026

Non-Final Rejection — §102 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

17/665,370

Patent 12614068

SYSTEMS AND METHODS FOR TRAINING NEURAL NETWORKS WITH SPARSE DATA

4y 2m to grant Granted Apr 28, 2026

18/146,075

Patent 12614113

USING CONSISTENCY METADATA FOR FILTERING OF MACHINE LEARNING DATA ACROSS JOBS

3y 4m to grant Granted Apr 28, 2026

17/886,055

Patent 12608600

SPECIALIZED FIXED FUNCTION HARDWARE FOR EFFICIENT CONVOLUTION

3y 8m to grant Granted Apr 21, 2026

17/943,176

Patent 12608615

JOINTLY PRUNING AND QUANTIZING DEEP NEURAL NETWORKS

3y 7m to grant Granted Apr 21, 2026

17/808,314

Patent 12596942

BLACK-BOX EXPLAINER FOR TIME SERIES FORECASTING

3y 9m to grant Granted Apr 07, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2

Expected OA Rounds

91%

Grant Probability

97%

With Interview (+6.3%)

2y 9m (~0m remaining)

Median Time to Grant

Low

PTA Risk

Based on 1130 resolved cases by this examiner. Grant probability derived from career allowance rate.