Prosecution Insights
Last updated: May 29, 2026
Application No. 17/798,141

LEARNING SYSTEM, LEARNING METHOD, AND LEARNING PROGRAM

Non-Final OA §103§112
Filed
Aug 08, 2022
Priority
Feb 07, 2020 — JP 2020-019844 +1 more
Examiner
HINCKLEY, CHASE PAUL
Art Unit
2124
Tech Center
2100 — Computer Architecture & Software
Assignee
Nagoya Institute Of Technology
OA Round
3 (Non-Final)
68%
Grant Probability
Favorable
3-4
OA Rounds
1m
Est. Remaining
78%
With Interview

Examiner Intelligence

Grants 68% — above average
68%
Career Allowance Rate
137 granted / 201 resolved
+13.2% vs TC avg
Moderate +10% lift
Without
With
+10.1%
Interview Lift
resolved cases with interview
Typical timeline
3y 11m
Avg Prosecution
11 currently pending
Career history
217
Total Applications
across all art units

Statute-Specific Performance

§101
1.1%
-38.9% vs TC avg
§103
94.4%
+54.4% vs TC avg
§102
3.5%
-36.5% vs TC avg
§112
0.7%
-39.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 201 resolved cases

Office Action

§103 §112
DETAILED ACTION This non-final office action is responsive to application 17/798,141 with applicant’s amendments and request for reconsideration with RCE submitted on 04 May 2026. Claim status is currently pending for claims 1-3 and 6; amended claims are 1 and 6 further corresponding to the independent claims; claims 4-5 and 7-10 are cancelled. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Continued Examination Under 37 CFR 1.114 A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 05/04/26 has been entered. Response to Remarks Applicant’s responsive remarks filed 05/04/26 are considered together with amendments in light of the outstanding issues which are addressed as follows. Objection to the specification is withdrawn upon further consideration of prior correction. Claim interpretation of functional language is withdrawn, claims are read as plain language given the standard of broadest reasonable interpretation. Rejection under 35 U.S.C. 101 abstract idea is hereby withdrawn. Particularly, eligibility is considered satisfied at step 2A prong 2 practical application by amendment requiring real mobile body for implementing cooperative action of a reinforcement learning agent. Remarks at [P.10 of 17 Last¶] persuasively support functionality that enables performance of coordinated movements, i.e. actions. Preponderance of the facts favor finding of eligibility and therefore the rejection is withdrawn. Rejection under 35 U.S.C. 103 obviousness is maintained over the prior combination of prior art. Applicant’s remarks point to amendments and traverse based on Hu not teaching amended limitation of the third step calculating a difference. However, Hu was not the reference applied to this limitation and remarks to date still do not substantively address Moissinac. Moissinac is critical because it teaches VCG Vickrey-Clarke-Groves mechanism in the context of multi-agent reinforcement learning to arrive at the calculated difference. Simply, the combination as a whole amounts to VCG of Moissinac for Hu’s MARL multi-agent reinforcement learning for cooperative actions. The additional language of “in such a way” conveys a desired effect as opposed to detailing how the reward is prevented from increasing in a technical manner. The aspect is interpreted as being satisfied by maximizing reward and/or conditions for negative payments/penalty as detailed below with respect to Moissinac whom indexes agents denoted in subscripts of technical solution. Concerning Hu, agents are assigned rewards e.g. [0218,240] for executing actions in a cooperative multi-agent reinforcement learning, CM3 algorithm Fig 5. Further, Hu provides real mobile bodies as vehicles for autonomous driving Figs 1-2, 11 described [0059]. The ordinarily skilled artisan would likely recognize Hu assigned to Honda as an automotive manufacturer. Accordingly, the additional language does not provide a clear distinction from the prior art. Therefore, the rejection is maintained as obvious over Hu and Moissinac. An updated rejection is detailed below. Claim Rejections - 35 USC § 112 The following is a quotation of 35 U.S.C. 112(b): (b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention. Claims 1-3 and 6 are rejected under 35 U.S.C. 112(b), as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor, regards as the invention. Particularly, claims 1 and 6 recites the limitation "the agent" in amendments. Antecedent basis of ‘the’ lacks clarity because “the” agent could be either -or all- of “target agent” or “other agents” as claimed in the claim drawn to (multi)-agent system for reinforcement learning. Thus it is not apparent which agent is being referred to. Additionally, prior to amendment a similar limitation “acquire a state of the agent” should also properly introduce an agent or each agent of the plurality of agents of the independent claims. Remaining claims depend from the identified language fail to cure the deficiency. Accordingly, claims 1-3 and 6 are rejected under 35 U.S.C. 112(b). Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows: 1. Determining the scope and contents of the prior art. 2. Ascertaining the differences between the prior art and the claims at issue. 3. Resolving the level of ordinary skill in the pertinent art. 4. Considering objective evidence present in the application indicating obviousness or nonobviousness. Claims 1-2 and 6 are rejected under 35 U.S.C. 103 as being unpatentable over: Hu et al., US PG Pub No 2019/0266489A1 (Honda, Tokyo) in view of Moissinac, Béatrice, “Reinforcement Learning-based Off-Equilibrium Incentives to Approximate the VCG Mechanism” hereinafter Moissinac (Thesis, OSU). With respect to claim 1, Hu teaches: A learning system for performing reinforcement learning of a cooperative action by a plurality of agents under a multi-agent system in which the plurality of agents perform the cooperative action, the learning system, that is implemented by a computer {Hu discloses [0013] “a system for cooperative multi-goal, multi-agent, multi-stage (CM3) reinforcement learning” Figs 1 and 21 is computer system for Fig 5A is reinforcement learning e.g. [0084] “multi-agent reinforcement learning (MARL)” where the agents comprise vehicles Fig 1:170,180 or shown Figs 2, 9, 14} comprising: the plurality of agents {Hu [0004] “the N number of agents in the multi-agent environment” e.g. Fig 3:304 “Train number of N agents to learn cooperation between agents in multi-agent environment”}; and a non-transitory memory configured to store a program {Hu [0045] “non-transitory computer-readable storage medium storing computer-executable instructions” e.g. RAM [0036] or shown Figs 1:104 and 21:2118 “memory” for executing software programs [0253-58]}, a processor configured to execute the program stored on the memory to cause the learning system {Hu [0055] “Fig.1 …CM3 reinforcement learning may include a processor 102, memory 104” [0253-58]} to: grant a reward to the plurality of agents {Hu Fig 1:132 reward module for agent vehicles, granting is assigning [0231] “each agent may be assigned with different individual rewards” similar at [0240], 16:1608 “assigning negative reward to agents”}, wherein each of the agents includes a second non-transitory memory configured to store a program {Hu Fig 11:1104 “memory” of each agent vehicle corresponds to Fig 1:170/180 with storage device 174/184, [0036, 45]}; a second processor configured to execute the program stored on the memory to cause the learning system {Fig 11:1102 “processor” of each agent vehicle corresponds Fig 1:170/180, [0036,45]} to: acquire a state of the agent {Hu discloses [0223] “set of states S describing possible configurations of all agents” introduced [0069] “agents may be completely specified by a state st ∈ S, …next state st+1” where ∈ denotes set membership}; acquire the reward {Hu [0223] “Each agent may receive a reward” again at [0069], acquire is receive and the system Fig 1 provides communication interfaces 172/182 and 150, [0057-59]}; select an action based on the state and the reward by using a decision-making model for selecting the action {Hu [0062] “select one action from a set of available actions, which results in a new state and a new reward” Fig 2 shown as loop, decision-making model comprises MDP Markov decision policy, e.g. [0224-23] “policy that defines how actions are selected” similarly [0209], where [0240] “policy may be based on a state” and [0187-88] “policy based on the reward function” again [0164-65] “choose the ‘best’ action… return of an action (a) under policy (π) on state (s)”}; and execute the action {Hu Fig 5A “Execute action(s)” after sampling action i.e. selected [0224-23]. By example, [0183] “actions to include an autonomous vehicle driving maneuver of accelerating” or actions of deceleration, braking and lane shift introduced [0006]}, the processor is further configured to execute the program stored on the memory to cause the learning system to perform {Hu Fig 1:!02 and 21:2114, [0055-56] “CM3 reinforcement learning may include a processor 102, a memory 104” [0252-53] “processor-executable instructions”} a first step of, in the presence of a target agent to which the reward is to be granted, calculating an evaluation value relating to a cooperative action of other agents as a first evaluation value {Hu [0088,087-90] Eq.1 ∑ y’R(st, at) “value function” again at [0226,225-228]; thereby [0081] “learning to cooperate in the presence of other agents” with evaluation comprising the R-reward [0049] which can be “individual reward” [0100,0231]. Fig 4 illustrates 2-stage curriculum learning that uses local information for single-agent setting (target agent) i.e. Fig 3:302 or [0205] “first agent is the only agent present”, and then a multi-agent setting with global data Fig 3:304. By example, see Fig 6 multi-car agents where speed is a calculated value}; a second step of, in the absence of the target agent, calculating an evaluation value relating to a cooperative action of the other agents as a second evaluation value {Hu [0062] “the absence of other agents may more reliably produce the desired state configurations for learning cooperative behavior” again [0205]. Notably, second calculated value may comprise [0107-13] “joint reward” Rg(s,a,g):=∑n=1N R(s, an, gn) may be defined” for a joint or overall policy which is described in as “weighted sum” similarly [0231,36]. See also [0090] “address the issue of multi-agent credit assignment… contribution of an agent’s chosen action and versus the average of all possible counterfactual actions a, keeping other agent’s actions fixed” and [0152] “differentiate between evaluations of the Q-function for different agents”. See also [0217,244]}; and the target agent performs learning of the decision-making model based on the reward granted {Hu [0205] “perform training for a first agent based on a first policy …using a Markov decision process (MDP)” learning is training, Markov-decision based policy- is model and reward includes return [0223] “Markov games may be defined… total expected return Rn=∑t=0Tytrtn” simulation of agent is described, Figs 2-3 show high level and Figs 4-5 detail technical implementation}. wherein the agent is a mobile body {Hu [0043] “An ‘agent’, as used herein, may refer to a ‘vehicle’” illustrated e.g. Figs 2, 11:170, 1:170/180}. the decision-making model of the agent is mounted on a real mobile body after accomplishment of the reinforcement learning, thereby implementing the cooperative action {Hu see Fig 1:140,170 shows policy network i.e. model, for vehicle 170 similar Fig 11, and described e.g. [0059] “CM3 policy network 140 is stored on the storage device of the vehicle, this enables the controller to autonomously drive the vehicle around based on the CM3 policy network 140, and to make autonomous driving decisions according to the CM3 reinforcement learning” detailed Fig 5A cooperative CM3 which executes actions}. Hu further discloses difference error [0227], subtractive functions throughout and penalties for agents as negative reward e.g. [0241], [0005]. However, Hu does not appear to disclose the following limitation particularly when read in light of specification [0021]. Moissinac teaches: a third step of calculating a difference between the first evaluation value and the second evaluation value as a penalty of the target agent and calculating the reward to be granted to the target agent in such a way as to prevent only the reward granted to the target agent from increasing based on the penalty {Moissinac see [P.18] Eq 3.4 ui(∙) = ∑v(∙) - ∑v(∙) “difference reward D(i) is implemented in the same way as ui …Thus, the utility, valuation and payment of the agent are implemented the same was as with the VCG” Vickrey-Clark-Grove mechanism, the difference reward applied Algorithm 1 [P.16] “Compute D(i) (Equation 3.4)” titled Reinforcement Learning Algorithm using Difference Utilities, the reinforcement learning shown Figs 2.2 and multi-agent Fig 2.3 where a world (environment) with and without agent ‘i’ are a statistical difference (2.7), penalty is [P.7 ¶1] “penalizing an agent’s utilities” in such a way that [P.9 ¶1] “If the net payment is negative (i.e. agent i had a positive effect on the system), it increases agent i’s utility” considers negative (as well as positive) payments for agent subject to the subtractive function. Implementation may be performed by maximizing [P.19 ¶5 - P.20] “maximize their difference reward D(i) …maximization of D(i)”. In D(i), indexing of agents provides for the target agent, e.g. individual agent i from a set membership n∈N plural agents. The approach would be consistent with solutions described by applicant1,2 and when read in light of specification [0017,30] and [0021]}, and Moissinac is directed to multiagent systems for reinforcement learning thus being analogous. A person having ordinary skill in the art would have considered it obvious prior to the effective filing date to shape reward using the technique of Moissinac in combination with Hu to arrive at the invention as claimed as applying a known technique to a known device ready for improvement to yield predictable results which is accomplished by using “a well-known solution to solve resource allocation problems with multiple items and multiple agents” [P.7 ¶1] and/or stated motivation [cont’d] “motivates a truthful participation from agents by penalizing an agent’s utilities with the cost of her distortion to the system’s value” [P.7 ¶1] and which “produces a social-optimal outcome” [P.2 ¶3]. Technical contributions are noted [P.4 ¶1-2] and real world domains of navigation, robot coordination and air traffic control are disclosed per [P.14 ¶2]. With respect to claim 2, the combination of Hu and Moissinac teaches the learning system according to claim 1. Moissinac teaches wherein the first evaluation value corresponds to an amount of increase given by subtracting the sum of evaluation values relating to a cooperative action before the other agents perform an action in the presence of the target agent from the sum of evaluation values relating to a cooperative action after the other agents perform the action in the presence of the target agent {Moissinac [P.18] Eq 3.4 where term on the left, i.e. ∑n∈N vn(kn, θn), is first evaluation value subject to subtracting the sum denoted ∑ sigma symbol, read in light of instant specification [0021,23]. Moissinac shows multi-agent RL Fig 2.3 and Eq2.7 establishes world with and without agent ‘i’, presence is described as being “in which i exists” this is reflected in the superscript of Eq3.4 which equates to Eq3.5 being same as Eq2.7. Further, the aspect of before and after is such that sequential learning is performed in multiagent reinforcement learning over time-steps [P.12 Sect2.3.1]}, and the second evaluation value corresponds to an amount of increase given by subtracting the sum of evaluation values relating to a cooperative action before the other agents perform actions in the absence of the target agent from the sum of evaluation values relating to a cooperative action after the other agents perform the actions in the absence of the target agent {Moissinac [P.18] Eq 3.4 where term on the right, i.e. ∑n∈N-i vn(k-in, θ-in), is second evaluation value subject to subtracting the sum denoted ∑ sigma symbol, read in light of instant specification [0021,24]. Moissinac shows multi-agent RL Fig 2.3 and Eq2.7 establishes world with and without agent ‘i’, absence is noted “where i does not exist” this is reflected absence of superscript of Eq3.4 which equates to Eq3.5 being same as Eq2.7. The aspect of before and after is such that sequential learning is performed in multiagent reinforcement learning over time-steps [P.12 Sect2.3.1]}. A person having ordinary skill in the art would have considered it obvious prior to the effective filing date to specify evaluation values according to the teachings of Moissinac in combination for motivation as already applied in claim 1 and further as “The VCG framework and the DU framework are somewhat distant field, so the notations have been unified” [P.6 ¶2]. Claims 4-5 (Cancelled). With respect to claim 6, the rejection of claim 1 is incorporated. The difference in scope being a method for performing limitations of system claim 1. Hu discloses [0046] “method for cooperative multi-goal, multi-agent, multi-stage reinforcement learning” methods comprising CM3 algorithm shown Fig 5 and [0252] “techniques presented herein”. The remainder of this claim is rejected for the same rationale as claim 1. Claims 7-10 (Cancelled). Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Hu and Moissinac in view of Yu et al., “Collective Learning for the Emergence of Social Norms in Networks Multiagent Systems” herein after Yu, and further in view of Wang et al., “Shapley Q-value: A Local Reward Approach to Solve Global Reward Games” hereinafter Wang (arXiv: 1907.05707v4). With respect to claim 3, the combination of Hu and Moissinac teaches the learning system according to claim 1 wherein the processor is further configured to execute the program stored on the memory to cause the learning system to perform, but does not disclose the following limitations. Yu teaches wherein the reward granting unit performs: a fourth step of causing the plurality of agents to perform weighted voting relating to whether to perform a cooperative action {Yu [P.2347 Sect.B ¶2-5] “Weighted voting” Eq.6 details performance of weighted voting, the weighting comprises “weight of each agent” and considers “interacting agents” actions of agents are described with reinforcement learning and rewards, introducing [P.2342 ¶1] “cooperation among distributed agents to achieve a consensus in the whole system”}; and Further Hu discloses discounting reward [0172,0223] and presence/absence of agents [0081,62]. However, the combination Hu and Yu does not prima facie disclose that reduced reward is subject to a condition of the following limitation. Wang teaches: a fifth step of, when a result of voting obtained in the absence of the target agent overturns, in a manner attributable to statistical analysis of voting among agents, a result of voting in the presence of the target agent, reducing a reward to be granted to the target agent by an amount of reward determined based on the result of voting in the absence of the target agent {Wang discloses [P.5 Sect5.1 ¶1] “reward will be reduced by 1” or [P.6 Last2¶] “reward will be reduced by 10” where 1 or 10 are scalar amounts of reduction in reward and describes agents such as cars in traffic or navigation settings. Mathematically, [P.3 Sect4.1] “∑yt-1rt(C)” where y is discount factor for the r-reward. The condition of when voting result in absence of agent overturns result of voting in presence of agent is interpreted as satisfied by Shapley value approximation which considers consider marginal contribution of each agent and may “mask the actions of irrelevant agents” considering “every possible combination of coalition C and agent i” [P.4 Last2¶] }. Wang is directed to multi-agent reinforcement learning thus being analogous. A person having ordinary skill in the art would have considered it obvious prior to the effective filing date to reduce reward using the techniques of Wang in combination to arrive at the invention as claimed for the stated motivation “our motivation of importing the local reward approach (e.g., Shapley value) to solve out the credit assignment problems” [P.7 ¶2] and “ensures a reasonable payoff distribution and inspires our work on credit assignment in MARL” [P.2 Last¶]. Conclusion Any inquiry concerning this communication or earlier communications from the examiner should be directed to Chase P Hinckley whose telephone number is (571)272-7935. The examiner can normally be reached M-F 9:00 - 5:00. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda M. Huang can be reached at 571-270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /CHASE P. HINCKLEY/Examiner, Art Unit 2124 1 Matsunami et al., “Reward Design for Multi-Agent Reinforcement Learning with a Penalty Based on the Payment Mechanism” (Instant Inventor) at [P.3 Sect. 2.3-3] Vickrey-Clarke-Groves (VCG) mechanism for integrating in combination with MARL multi-agent reinforcement learning, and [P.4 Last Par.] maximizes reward. 2 Scheible, Dominik Ph.D. attorney remarks to EPO at [P.3 Par.3-4] VCG highlighted, appears to be repeated as per instant application remarks 01/02/26 at [P.31 of 32, Par.1-2].
Read full office action

Prosecution Timeline

Aug 08, 2022
Application Filed
Oct 02, 2025
Non-Final Rejection mailed — §103, §112
Jan 02, 2026
Response Filed
Feb 02, 2026
Final Rejection mailed — §103, §112
May 04, 2026
Request for Continued Examination
May 05, 2026
Response after Non-Final Action
May 15, 2026
Non-Final Rejection mailed — §103, §112 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12639499
INFORMATION PROCESSING SYSTEM, COMPUTER SYSTEM, INFORMATION PROCESSING METHOD, AND PROGRAM
4y 8m to grant Granted May 26, 2026
Patent 12639122
PROCESSING COMPUTATIONAL GRAPHS
2y 9m to grant Granted May 26, 2026
Patent 12614081
FRACTAL COGNITIVE COMPUTING NODE, COMPUTER-IMPLEMENTED METHOD FOR LEARNING PROCEDURES, COMPUTATIONAL COGNITION CLUSTER AND COMPUTATIONAL COGNITION ARCHITECTURE
4y 9m to grant Granted Apr 28, 2026
Patent 12608444
AUTOMATED SELECTION OF PRINCIPAL COMPONENT ANALYSIS VARIANTS FOR LARGE-SCALE DATAASETS
3y 8m to grant Granted Apr 21, 2026
Patent 12608585
NEURAL NETWORKS FOR SELECTING ACTIONS TO BE PERFORMED BY A ROBOTIC AGENT
3y 9m to grant Granted Apr 21, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

3-4
Expected OA Rounds
68%
Grant Probability
78%
With Interview (+10.1%)
3y 11m (~1m remaining)
Median Time to Grant
High
PTA Risk
Based on 201 resolved cases by this examiner. Grant probability derived from career allowance rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month