DETAILED ACTION
This non-final office action is responsive to application 17/798,141 with applicant’s amendments and request for reconsideration with RCE submitted on 04 May 2026.
Claim status is currently pending for claims 1-3 and 6; amended claims are 1 and 6 further corresponding to the independent claims; claims 4-5 and 7-10 are cancelled.
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 05/04/26 has been entered.
Response to Remarks
Applicant’s responsive remarks filed 05/04/26 are considered together with amendments in light of the outstanding issues which are addressed as follows.
Objection to the specification is withdrawn upon further consideration of prior correction.
Claim interpretation of functional language is withdrawn, claims are read as plain language given the standard of broadest reasonable interpretation.
Rejection under 35 U.S.C. 101 abstract idea is hereby withdrawn. Particularly, eligibility is considered satisfied at step 2A prong 2 practical application by amendment requiring real mobile body for implementing cooperative action of a reinforcement learning agent. Remarks at [P.10 of 17 Last¶] persuasively support functionality that enables performance of coordinated movements, i.e. actions. Preponderance of the facts favor finding of eligibility and therefore the rejection is withdrawn.
Rejection under 35 U.S.C. 103 obviousness is maintained over the prior combination of prior art. Applicant’s remarks point to amendments and traverse based on Hu not teaching amended limitation of the third step calculating a difference. However, Hu was not the reference applied to this limitation and remarks to date still do not substantively address Moissinac. Moissinac is critical because it teaches VCG Vickrey-Clarke-Groves mechanism in the context of multi-agent reinforcement learning to arrive at the calculated difference. Simply, the combination as a whole amounts to VCG of Moissinac for Hu’s MARL multi-agent reinforcement learning for cooperative actions. The additional language of “in such a way” conveys a desired effect as opposed to detailing how the reward is prevented from increasing in a technical manner. The aspect is interpreted as being satisfied by maximizing reward and/or conditions for negative payments/penalty as detailed below with respect to Moissinac whom indexes agents denoted in subscripts of technical solution. Concerning Hu, agents are assigned rewards e.g. [0218,240] for executing actions in a cooperative multi-agent reinforcement learning, CM3 algorithm Fig 5. Further, Hu provides real mobile bodies as vehicles for autonomous driving Figs 1-2, 11 described [0059]. The ordinarily skilled artisan would likely recognize Hu assigned to Honda as an automotive manufacturer. Accordingly, the additional language does not provide a clear distinction from the prior art. Therefore, the rejection is maintained as obvious over Hu and Moissinac. An updated rejection is detailed below.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
Claims 1-3 and 6 are rejected under 35 U.S.C. 112(b), as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor, regards as the invention. Particularly, claims 1 and 6 recites the limitation "the agent" in amendments. Antecedent basis of ‘the’ lacks clarity because “the” agent could be either -or all- of “target agent” or “other agents” as claimed in the claim drawn to (multi)-agent system for reinforcement learning. Thus it is not apparent which agent is being referred to. Additionally, prior to amendment a similar limitation “acquire a state of the agent” should also properly introduce an agent or each agent of the plurality of agents of the independent claims. Remaining claims depend from the identified language fail to cure the deficiency. Accordingly, claims 1-3 and 6 are rejected under 35 U.S.C. 112(b).
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-2 and 6 are rejected under 35 U.S.C. 103 as being unpatentable over:
Hu et al., US PG Pub No 2019/0266489A1 (Honda, Tokyo) in view of
Moissinac, Béatrice, “Reinforcement Learning-based Off-Equilibrium Incentives to Approximate the VCG Mechanism” hereinafter Moissinac (Thesis, OSU).
With respect to claim 1, Hu teaches:
A learning system for performing reinforcement learning of a cooperative action by a plurality of agents under a multi-agent system in which the plurality of agents perform the cooperative action, the learning system, that is implemented by a computer {Hu discloses [0013] “a system for cooperative multi-goal, multi-agent, multi-stage (CM3) reinforcement learning” Figs 1 and 21 is computer system for Fig 5A is reinforcement learning e.g. [0084] “multi-agent reinforcement learning (MARL)” where the agents comprise vehicles Fig 1:170,180 or shown Figs 2, 9, 14} comprising:
the plurality of agents {Hu [0004] “the N number of agents in the multi-agent environment” e.g. Fig 3:304 “Train number of N agents to learn cooperation between agents in multi-agent environment”}; and
a non-transitory memory configured to store a program {Hu [0045] “non-transitory computer-readable storage medium storing computer-executable instructions” e.g. RAM [0036] or shown Figs 1:104 and 21:2118 “memory” for executing software programs [0253-58]},
a processor configured to execute the program stored on the memory to cause the learning system {Hu [0055] “Fig.1 …CM3 reinforcement learning may include a processor 102, memory 104” [0253-58]} to:
grant a reward to the plurality of agents {Hu Fig 1:132 reward module for agent vehicles, granting is assigning [0231] “each agent may be assigned with different individual rewards” similar at [0240], 16:1608 “assigning negative reward to agents”}, wherein each of the agents includes
a second non-transitory memory configured to store a program {Hu Fig 11:1104 “memory” of each agent vehicle corresponds to Fig 1:170/180 with storage device 174/184, [0036, 45]};
a second processor configured to execute the program stored on the memory to cause the learning system {Fig 11:1102 “processor” of each agent vehicle corresponds Fig 1:170/180, [0036,45]} to:
acquire a state of the agent {Hu discloses [0223] “set of states S describing possible configurations of all agents” introduced [0069] “agents may be completely specified by a state st ∈ S, …next state st+1” where ∈ denotes set membership};
acquire the reward {Hu [0223] “Each agent may receive a reward” again at [0069], acquire is receive and the system Fig 1 provides communication interfaces 172/182 and 150, [0057-59]};
select an action based on the state and the reward by using a decision-making model for selecting the action {Hu [0062] “select one action from a set of available actions, which results in a new state and a new reward” Fig 2 shown as loop, decision-making model comprises MDP Markov decision policy, e.g. [0224-23] “policy that defines how actions are selected” similarly [0209], where [0240] “policy may be based on a state” and [0187-88] “policy based on the reward function” again [0164-65] “choose the ‘best’ action… return of an action (a) under policy (π) on state (s)”}; and
execute the action {Hu Fig 5A “Execute action(s)” after sampling action i.e. selected [0224-23]. By example, [0183] “actions to include an autonomous vehicle driving maneuver of accelerating” or actions of deceleration, braking and lane shift introduced [0006]}, the processor is further configured to execute the program stored on the memory to cause the learning system to perform {Hu Fig 1:!02 and 21:2114, [0055-56] “CM3 reinforcement learning may include a processor 102, a memory 104” [0252-53] “processor-executable instructions”}
a first step of, in the presence of a target agent to which the reward is to be granted, calculating an evaluation value relating to a cooperative action of other agents as a first evaluation value {Hu [0088,087-90] Eq.1 ∑ y’R(st, at) “value function” again at [0226,225-228]; thereby [0081] “learning to cooperate in the presence of other agents” with evaluation comprising the R-reward [0049] which can be “individual reward” [0100,0231]. Fig 4 illustrates 2-stage curriculum learning that uses local information for single-agent setting (target agent) i.e. Fig 3:302 or [0205] “first agent is the only agent present”, and then a multi-agent setting with global data Fig 3:304. By example, see Fig 6 multi-car agents where speed is a calculated value};
a second step of, in the absence of the target agent, calculating an evaluation value relating to a cooperative action of the other agents as a second evaluation value {Hu [0062] “the absence of other agents may more reliably produce the desired state configurations for learning cooperative behavior” again [0205]. Notably, second calculated value may comprise [0107-13] “joint reward” Rg(s,a,g):=∑n=1N R(s, an, gn) may be defined” for a joint or overall policy which is described in as “weighted sum” similarly [0231,36]. See also [0090] “address the issue of multi-agent credit assignment… contribution of an agent’s chosen action and versus the average of all possible counterfactual actions a, keeping other agent’s actions fixed” and [0152] “differentiate between evaluations of the Q-function for different agents”. See also [0217,244]}; and
the target agent performs learning of the decision-making model based on the reward granted {Hu [0205] “perform training for a first agent based on a first policy …using a Markov decision process (MDP)” learning is training, Markov-decision based policy- is model and reward includes return [0223] “Markov games may be defined… total expected return Rn=∑t=0Tytrtn” simulation of agent is described, Figs 2-3 show high level and Figs 4-5 detail technical implementation}.
wherein the agent is a mobile body {Hu [0043] “An ‘agent’, as used herein, may refer to a ‘vehicle’” illustrated e.g. Figs 2, 11:170, 1:170/180}.
the decision-making model of the agent is mounted on a real mobile body after accomplishment of the reinforcement learning, thereby implementing the cooperative action {Hu see Fig 1:140,170 shows policy network i.e. model, for vehicle 170 similar Fig 11, and described e.g. [0059] “CM3 policy network 140 is stored on the storage device of the vehicle, this enables the controller to autonomously drive the vehicle around based on the CM3 policy network 140, and to make autonomous driving decisions according to the CM3 reinforcement learning” detailed Fig 5A cooperative CM3 which executes actions}.
Hu further discloses difference error [0227], subtractive functions throughout and penalties for agents as negative reward e.g. [0241], [0005].
However, Hu does not appear to disclose the following limitation particularly when read in light of specification [0021].
Moissinac teaches:
a third step of calculating a difference between the first evaluation value and the second evaluation value as a penalty of the target agent and calculating the reward to be granted to the target agent in such a way as to prevent only the reward granted to the target agent from increasing based on the penalty {Moissinac see [P.18] Eq 3.4 ui(∙) = ∑v(∙) - ∑v(∙) “difference reward D(i) is implemented in the same way as ui …Thus, the utility, valuation and payment of the agent are implemented the same was as with the VCG” Vickrey-Clark-Grove mechanism, the difference reward applied Algorithm 1 [P.16] “Compute D(i) (Equation 3.4)” titled Reinforcement Learning Algorithm using Difference Utilities, the reinforcement learning shown Figs 2.2 and multi-agent Fig 2.3 where a world (environment) with and without agent ‘i’ are a statistical difference (2.7), penalty is [P.7 ¶1] “penalizing an agent’s utilities” in such a way that [P.9 ¶1] “If the net payment is negative (i.e. agent i had a positive effect on the system), it increases agent i’s utility” considers negative (as well as positive) payments for agent subject to the subtractive function. Implementation may be performed by maximizing [P.19 ¶5 - P.20] “maximize their difference reward D(i) …maximization of D(i)”. In D(i), indexing of agents provides for the target agent, e.g. individual agent i from a set membership n∈N plural agents. The approach would be consistent with solutions described by applicant1,2 and when read in light of specification [0017,30] and [0021]}, and
Moissinac is directed to multiagent systems for reinforcement learning thus being analogous. A person having ordinary skill in the art would have considered it obvious prior to the effective filing date to shape reward using the technique of Moissinac in combination with Hu to arrive at the invention as claimed as applying a known technique to a known device ready for improvement to yield predictable results which is accomplished by using “a well-known solution to solve resource allocation problems with multiple items and multiple agents” [P.7 ¶1] and/or stated motivation [cont’d] “motivates a truthful participation from agents by penalizing an agent’s utilities with the cost of her distortion to the system’s value” [P.7 ¶1] and which “produces a social-optimal outcome” [P.2 ¶3]. Technical contributions are noted [P.4 ¶1-2] and real world domains of navigation, robot coordination and air traffic control are disclosed per [P.14 ¶2].
With respect to claim 2, the combination of Hu and Moissinac teaches the learning system according to claim 1. Moissinac teaches wherein
the first evaluation value corresponds to an amount of increase given by subtracting the sum of evaluation values relating to a cooperative action before the other agents perform an action in the presence of the target agent from the sum of evaluation values relating to a cooperative action after the other agents perform the action in the presence of the target agent {Moissinac [P.18] Eq 3.4 where term on the left, i.e. ∑n∈N vn(kn, θn), is first evaluation value subject to subtracting the sum denoted ∑ sigma symbol, read in light of instant specification [0021,23]. Moissinac shows multi-agent RL Fig 2.3 and Eq2.7 establishes world with and without agent ‘i’, presence is described as being “in which i exists” this is reflected in the superscript of Eq3.4 which equates to Eq3.5 being same as Eq2.7. Further, the aspect of before and after is such that sequential learning is performed in multiagent reinforcement learning over time-steps [P.12 Sect2.3.1]}, and
the second evaluation value corresponds to an amount of increase given by subtracting the sum of evaluation values relating to a cooperative action before the other agents perform actions in the absence of the target agent from the sum of evaluation values relating to a cooperative action after the other agents perform the actions in the absence of the target agent {Moissinac [P.18] Eq 3.4 where term on the right, i.e. ∑n∈N-i vn(k-in, θ-in), is second evaluation value subject to subtracting the sum denoted ∑ sigma symbol, read in light of instant specification [0021,24]. Moissinac shows multi-agent RL Fig 2.3 and Eq2.7 establishes world with and without agent ‘i’, absence is noted “where i does not exist” this is reflected absence of superscript of Eq3.4 which equates to Eq3.5 being same as Eq2.7. The aspect of before and after is such that sequential learning is performed in multiagent reinforcement learning over time-steps [P.12 Sect2.3.1]}.
A person having ordinary skill in the art would have considered it obvious prior to the effective filing date to specify evaluation values according to the teachings of Moissinac in combination for motivation as already applied in claim 1 and further as “The VCG framework and the DU framework are somewhat distant field, so the notations have been unified” [P.6 ¶2].
Claims 4-5 (Cancelled).
With respect to claim 6, the rejection of claim 1 is incorporated. The difference in scope being a method for performing limitations of system claim 1. Hu discloses [0046] “method for cooperative multi-goal, multi-agent, multi-stage reinforcement learning” methods comprising CM3 algorithm shown Fig 5 and [0252] “techniques presented herein”. The remainder of this claim is rejected for the same rationale as claim 1.
Claims 7-10 (Cancelled).
Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Hu and Moissinac in view of Yu et al., “Collective Learning for the Emergence of Social Norms in Networks Multiagent Systems” herein after Yu, and further in view of Wang et al., “Shapley Q-value: A Local Reward Approach to Solve Global Reward Games” hereinafter Wang (arXiv: 1907.05707v4).
With respect to claim 3, the combination of Hu and Moissinac teaches the learning system according to claim 1 wherein the processor is further configured to execute the program stored on the memory to cause the learning system to perform, but does not disclose the following limitations. Yu teaches wherein the reward granting unit performs:
a fourth step of causing the plurality of agents to perform weighted voting relating to whether to perform a cooperative action {Yu [P.2347 Sect.B ¶2-5] “Weighted voting” Eq.6 details performance of weighted voting, the weighting comprises “weight of each agent” and considers “interacting agents” actions of agents are described with reinforcement learning and rewards, introducing [P.2342 ¶1] “cooperation among distributed agents to achieve a consensus in the whole system”}; and
Further Hu discloses discounting reward [0172,0223] and presence/absence of agents [0081,62].
However, the combination Hu and Yu does not prima facie disclose that reduced reward is subject to a condition of the following limitation.
Wang teaches:
a fifth step of, when a result of voting obtained in the absence of the target agent overturns, in a manner attributable to statistical analysis of voting among agents, a result of voting in the presence of the target agent, reducing a reward to be granted to the target agent by an amount of reward determined based on the result of voting in the absence of the target agent {Wang discloses [P.5 Sect5.1 ¶1] “reward will be reduced by 1” or [P.6 Last2¶] “reward will be reduced by 10” where 1 or 10 are scalar amounts of reduction in reward and describes agents such as cars in traffic or navigation settings. Mathematically, [P.3 Sect4.1] “∑yt-1rt(C)” where y is discount factor for the r-reward. The condition of when voting result in absence of agent overturns result of voting in presence of agent is interpreted as satisfied by Shapley value approximation which considers consider marginal contribution of each agent and may “mask the actions of irrelevant agents” considering “every possible combination of coalition C and agent i” [P.4 Last2¶] }.
Wang is directed to multi-agent reinforcement learning thus being analogous. A person having ordinary skill in the art would have considered it obvious prior to the effective filing date to reduce reward using the techniques of Wang in combination to arrive at the invention as claimed for the stated motivation “our motivation of importing the local reward approach (e.g., Shapley value) to solve out the credit assignment problems” [P.7 ¶2] and “ensures a reasonable payoff distribution and inspires our work on credit assignment in MARL” [P.2 Last¶].
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Chase P Hinckley whose telephone number is (571)272-7935. The examiner can normally be reached M-F 9:00 - 5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda M. Huang can be reached at 571-270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/CHASE P. HINCKLEY/Examiner, Art Unit 2124
1 Matsunami et al., “Reward Design for Multi-Agent Reinforcement Learning with a Penalty Based on the Payment Mechanism” (Instant Inventor) at [P.3 Sect. 2.3-3] Vickrey-Clarke-Groves (VCG) mechanism for integrating in combination with MARL multi-agent reinforcement learning, and [P.4 Last Par.] maximizes reward.
2 Scheible, Dominik Ph.D. attorney remarks to EPO at [P.3 Par.3-4] VCG highlighted, appears to be repeated as per instant application remarks 01/02/26 at [P.31 of 32, Par.1-2].