DETAILED ACTION
This final office action is responsive to application 17/798,141 with applicant’s amendments as submitted 02 Jan. 2026.
Claim status is currently pending and under examination for claims 1-3 and 5-6 of which the independent claims are 1 and 6; amended claims are 1, 3 and 6; canceled claims are 4 and 7-10.
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Remarks
Applicant’s proposed new title is acceptable, the objection will be withdrawn upon filing of an amended specification or new ADS pursuant to 37 CFR 1.76 and 1.72. However, since the proposed change is contained only in remarks, the objection will be maintained until entry of marked up specification or updated application data sheet (ADS).
Objection to claims 4, 7 and 10 is withdrawn as necessitated by cancelation of affected claims.
Claim interpretation under 35 U.S.C. 112(f) is withdrawn in-part such that the amendments overcome functional interpretation of system claims but not methods claims. Examiner acknowledges applicant’s remarks with intent to avoid 112(f). However, the method includes “step of” language which invokes functional interpretation in view of MPEP 2181. This “step of” language could be stricken out to avoid 112(f).
The rejection of claims 3-4 and 7 under 35 U.S.C. 112(b) is hereby withdrawn. Claims 4 and 7 are cancelled, and further claim 3 is amended to reflect suggested language in addressing the issue.
The rejection under 35 U.S.C. 101 as being directed to abstract idea without significantly more is maintained. Applicant’s arguments over eligibility have been considered and are not persuasive. Applicant initially traverses at prong one such that the claimed calculating cannot be an abstract idea (pointing out the mathematical equation used to calculate) as the human mind is not equipped to add or subtract and multiply, further noting Examples 38, 41, memorandum, mpep and extensive portions of the specification. However, calculations are explicitly an abstract idea under MPEP 2106.04(a)(2) “A claim that recites a mathematical calculation, when the claim is given its broadest reasonable interpretation in light of the specification, will be considered as falling within the ‘mathematical concepts’ grouping” with examples including ii. and vi. “calculating.” The BRI of calculating in light of the specification’s equation is a mathematical calculation. Applicant states “claim 1 of the instant application do not explicitly recite a mathematical relationship, formula, or calculation.” However, claims 1 recites “calculating” no less than four different times, used in each of the three principal steps. Example 38 is drawn to analog circuits for audio mixer which is nothing at all like the claims at hand. Example 41 involves cryptographic communications and affirms that an equation is a math calculation. While additional elements of Example 41 cure eligibility with cryptographic cyphertext and unlike instant application, it does not portend that the equation isn’t math. Concerning the memorandum, it is prefaced “This memorandum is not intended to announce any new USPTO practice or procedure and is meant to be consistent with existing USPTO guidance.” Consistent with existing guidance from MPEP 2106.04(a)(2) reciting calculations is not loosely based on or somehow involve math, it is math and it most assuredly a math calculation when read in light of the specification as can be seen from the application’s PG Pub cover page.
Applicant further traverses at step 2A prong two, practical application is satisfied with improvements speculated, further noting Ex Parte Smith, Ex Parte Desjardins and Recentive v. Fox among numerous case law studies. MPEP is meant to distill case law without discourse of every decision. It is clear from remarks [P.22] that underlined emphasis is placed on steps 1-3 and final limitations which correspond to the abstract idea. Consistent with MPEP 2106.05(a)(II) “it is important to keep in mind that an improvement in the abstract idea itself is not an improvement in technology.” That is to say, improvement to the idea is merely an embellishment of the idea itself and does not cure eligibility. Practical application regards additional elements which in this case consists of general computer components for implementing MARL (multi-agent reinforcement learning) which is a well-studied discipline that would be recognized by those with relevant background, see e.g. supplemental evidence Xu et al., “Voting-based Multi-Agent Reinforcement Learning” arXiv: 1907.01385v1 as per [P.7 Alg.1]. Not only is MARL not new, neither is a solution based on proposed VCG mechanism (Moissinac below). It therefore follows that no inventive concept under step 2B can be distilled to substantiate applicant’s traversal from a technical perspective. Whether or not further benefit could be crafted or surmised, applicant admits in a separate publication that implementation employed the simulation tool Unity which again, is not new (Matsunami and Juliani below). The matter simply shows that the skilled artisan is reasonably equipped with the requisite tools to perform an ordinary level of experimentation for optimizing, e.g. rewards, in a technical capacity. Reviewing the noted legal cases does not reflect similar findings or an overlap of substantive material for the instant application to turn on. For example, Smith involves trading derivatives for financial markets, Desjardins of the newer memorandum involves a particular parameterization for training machine learning models (as opposed to decision-making models), and Fox v. Recentive indicates that generic training does not cure eligibility. Examiner notions that had the application further developed the RL agents into concrete embodiments such as an autonomous vehicle navigation system this might serve as the basis for overcoming the rejection. However, nothing is presented that would suggest as much to be the case, but rather the reader is presented with computer implementation for virtual simulation of MARL with reward optimization. Accordingly, the rejection is maintained.
The rejection under 35 U.S.C. 103 as obvious over a combination of prior art is maintained. Applicant’s remarks regarding the prior art have been fully considered and are not persuasive.
Applicant appears to traverse on the grounds that Hu’s teaching differ concerning a definition of the second evaluation value, and learning based on single agent versus multi-agent among stages of Hu. However, the examiner respectfully disagrees.
Hu proposes Fig 5A “Cooperative…multi-agent reinforcement learning (CM3)” with equations specified in disclosure [0088,0107], describes the “presence” and “absence” of other agents [0081,62] as well as rewards which include “individual rewards” and “joint rewards” [0231]. The claim language descriptively characterizing an agent as a target agent can be a single agent considered among the framework for multi-agent reinforcement learning (CM3). Critically, the claim does not require what applicant refers to as “definition” of the second evaluation value, which appears to be imported from the specification improperly. Equating a term of the claim with an exact equation is an overly narrow interpretation that is not commensurate with the metes and bounds afforded by the breadth of claim. Reviewing the instant specification reveals no formal definitions or rigid requirement that the evaluation value must be in the form of a specific equation. Rather, the claimed evaluation value are evaluated as relating to actions of agents. Hu discloses that evaluation by reward relates to actions for simulating agents [0049-48, 0170]. The approach addresses multi-agent credit assignment [0228] and assigns a penalty (i.e. nuisance fee) or negative reward to agents [0217,0244]. Therefore, the balance of evidence favors a finding that Hu fairly teaches or suggests the claim limitations as indicated below.
Regarding further obviousness with Moissinac, applicant notes that the nuisance fee calculating is an application of VCG (Vickrey-Clarke-Grooves) mechanism and Moissinac fails to cure deficiencies. Examiner respectfully disagrees as this is very title and main contribution of Moisinnac who employs VCG mechanism for reinforcement learning with multiagent systems. It is specifically equation 3.4 [P.18] which is applied in Algorithm 1 [P.16] and illustrated Figs 2.1, 2.3. Applicant provides no technical reason resolving why the VCG mechanism of Moissinac is insufficient to meet the desired effects. As such, the conclusory statement does not resolve patentability of non-obvious combination. In consideration of the foregoing, the arguments are not persuasive. The arguments presented above support the rejections to independent claims 1, 6 and related dependent claims.
Specification
The specification is objected to because the original title of the invention is not descriptive. A new title is required that is clearly indicative of the invention to which the claims are directed. Applicant’s proposed title in remarks 01/02/26 is acceptable and the objection will be withdrawn upon filing of amended specification and/or updated ADS. See MPEP 606.06.
The specification at [0003,02] refers to Japanese application 2019-19040 which is unrelated as being drawn to a field of crystal growth on substrates. The application number should be corrected or checked to confirm its relevance.
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f), because the claim limitations use a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. Such claim limitations are in Claim 6:
“a first step of, …, calculating an evaluation value”
“a second step of, …, calculating an evaluation value”
“a third step of calculating a difference”
Because these claim limitations are being interpreted under 35 U.S.C. 112(f), they are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
The corresponding structure, material or acts is interpreted in light of the specification. The specification US2023/0071293A1 discloses example embodiments at [0019,42] and includes descriptive language similar to that of the claims as per [0038,49]. The first and second embodiments are sufficient to perform the first, second and third steps that are identified above.
If applicant does not intend to have these limitation(s) interpreted under 35 U.S.C. 112(f), applicant may: (1) amend the claim limitations to avoid them being interpreted under 35 U.S.C. 112(f) (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitations recite sufficient structure to perform the claimed function so as to avoid them being interpreted under 35 U.S.C. 112(f).
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-3 and 5-6 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. In determining whether the claims are subject matter eligible, the examiner applies guidance set forth under MPEP 2106. The response to remarks above are incorporated herein.
Step 1: Is the claim to a process, machine, manufacture, or composition of matter? Yes—all claims fall within one of the four statutory categories: claims 1-3 and 5 are a system/machine, and claim 6 is a method/process. Thus, the analysis should proceed per MPEP 2106.03.
Step 2A, prong one: Does the claim recite an abstract idea, law of nature or natural phenomenon? Yes—the claims, under the broadest reasonable interpretation, recites an abstract idea. In this case, claims fall within the enumerated grouping of abstract idea being “Mathematical Concepts” and/or “Mental Processes” but for the recitation of generic computer components. More particularly, claims recite:
“a first step of, in the presence of a target agent to which the reward is to be granted, calculating an evaluation value relating to a cooperative action of other agents as a first evaluation value” (Mathematical Calculations)
“a second step of, in the absence of the target agent, calculating an evaluation value relating to a cooperative action of the other agents as a second evaluation value” (Math Calculation)
“a third step of calculating a difference between the first evaluation value and the second evaluation value as a penalty of the target agent and calculating the reward to be granted to the target agent based on the penalty” (Mathematical Calculations)
“the target agent performs learning of the decision-making model based on the reward granted from the reward granting unit” (Mental evaluation or judgment based on math)
Focus of the claims concern calculating and decision-making. Math calculations are specifically enumerated as an abstract idea under MPEP 2106.04(a)(2)(C) and are clearly described by instant specification [0021] Eq. 1, Fig 2. This calculating does not preclude mental performance as it is a difference of summations that may entail a product. Addition, subtraction and products are not impractical to conceive in the human mind as there is no rigorous criteria requiring calculus-based gradient or partial derivative, nonlinear activation or the like (the learning is not deep learning via neural networks, deep q-networks or multi-layer perceptron). Rather, the learning is cast as a decision-making model (e.g. Markov decision) using reward. Reward is the key concept and it is often practiced to perform reward shaping (optimization of reward functions by statistical analysis). Doing so in accordance with the claimed calculating steps principally capture mathematical calculations and/or mental processes as the abstract idea being set forth under MPEP 2106.04(a)(2).
Step 2A, prong two: Does the claim recite additional elements that integrate the judicial exception into a practical application? No—a practical application is not integrated by the judicial exception because the additional elements are as follows:
Claims set up MARL (multi-agent reinforcement learning) and recite general computer elements
“performing reinforcement learning of a cooperative action by a plurality of agents under a multi-agent system in which the plurality of agents perform the cooperative action” cont’d “comprising: the plurality of agents; and grant a reward to the plurality of agents, wherein each of the agents includes… acquire a state of the agent; acquire the reward; select an action based on the state and the reward by using a decision-making model for selecting the action; and execute the action selected” MPEP 2106.05(h) generally linking the use of the judicial exception to a particular technological environment or field of use (noted as MARL)
“learning system, that is implemented by a computer, comprising: …non-transitory memory configured to store a program, a processor configured to execute the program stored on the memory” as well as “a second non-transitory memory configured to store a program; a second processor configured to execute the program stored on the memory” and additionally “the processor is further configured to execute the program stored on the memory” MPEP 2106.05(f) merely uses a computer as a tool to perform an abstract idea
These additional elements constitute the balance of the claims which is considered to be generally linking the use of the judicial exception to a particular technological environment or field of use under MPEP 2106.05(h) and with computer system as a tool to implement the abstract idea under MPEP 2106.05(f). Adding a plurality of non-transitory computer elements does not breathe life into the claim as a whole, it merely suggests that agents are also computer implemented which could be a simulation, e.g. using a computer to simulate virtual agents in a video game that employs multi-agent reinforcement learning. The agents do not convey a physical embodiment in concrete real-world use case and the actions are recited at a high level of generality which can be a simulation. No replay buffer is described which is used in reinforcement learning to store prior runs with episodic memory. Additional elements do not meaningfully limit the claim in a manner rising above that which is already established in MARL where agents receive reward from environment using state-action pair or tuple data to draw samples from and perform interactions with other agents. As is set forth per MPEP 2106.04(a)(2) A claim that requires a computer may still recite an abstract idea. Accordingly, the claims remain drawn to the abstract idea and additional elements fail to integrate the judicial exception into a practical application.
Step 2B: Does the claim recite additional elements that amount to significantly more than the judicial exception? No—the claims do not include additional elements that amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea in to a practical application, the additional elements are identified with respect to MPEP 2106.05 and do not demonstrate an inventive concept. Particularly, the additional elements are as follows:
Claims set up MARL (multi-agent reinforcement learning) and recite general computer elements
“performing reinforcement learning of a cooperative action by a plurality of agents under a multi-agent system in which the plurality of agents perform the cooperative action” cont’d “comprising: the plurality of agents; and grant a reward to the plurality of agents, wherein each of the agents includes… acquire a state of the agent; acquire the reward; select an action based on the state and the reward by using a decision-making model for selecting the action; and execute the action selected” MPEP 2106.05(h) generally linking the use of the judicial exception to a particular technological environment or field of use (noted as MARL).
“learning system, that is implemented by a computer, comprising: …non-transitory memory configured to store a program, a processor configured to execute the program stored on the memory” as well as “a second non-transitory memory configured to store a program; a second processor configured to execute the program stored on the memory” and additionally “the processor is further configured to execute the program stored on the memory” MPEP 2106.05(f) merely uses a computer as a tool to perform an abstract idea
Significantly more is not established by the additional elements which is considered to be generally linking the use of the judicial exception to a particular technological environment or field of use under MPEP 2106.05(h) and with the system as mere use of a computer as a tool to perform an abstract idea under MPEP 2106.05(f). More particularly, units as a computer does not qualify as a particular machine under MPEP 2106.05(b), nor do the agents which have no physical form beyond generic computer elements as a requirement. The general linking of the judicial exception to MARL as a particular technological environment or field of use is not new. To be sure, inventor’s own words state per Matsunami et al., “Agents that Learn to Vote for a Joint Action Through Multi-Agent Reinforcement Learning” at [P.832 Sect.V ¶1] “We use… the Unity ML-Agents Toolkit” which is specifically found as Juliani et al., “Unity: A General Platform for Intelligent Agents” arXiv:1809.02627v1 at [P.8 Sect.4] Fig 2 and introduced [P.6 Sect.3] Fig 1. That is to say, only the work of others is conveyed by the additional elements and therefore cannot constitute an inventive concept or cure eligibility by amounting to significantly more. Therefore, the claims remain drawn to the abstract idea and additional elements are insufficient to amount to significantly more.
Taken alone, their additional elements do not amount to significantly more than the above-identified judicial exception (the abstract idea). Looking at the limitations as an ordered combination adds nothing that is not already present when looking at the elements taken individually. There is no indication that the combination of elements improves the functioning of a computer or improves any other technology. Their collective functions merely provide conventional computer implementation.
For at least the foregoing reasons, the independent claims are found not patent eligible. This rejection applies to independent claims 1 and 6 as well as to dependent claims 2-3 and 5. The dependent claims are held to be patent ineligible under 35 U.S.C. 101 because the additional recited limitations fail to establish that the claims are not directed to an abstract idea, or that they include additional elements which integrate the judicial exception into a practical application or amount to significantly more.
Dependent claim 2 discloses wherein the first evaluation value and second evaluation value respectively correspond to amounts of increase given by subtracting the sum of evaluation values. This is considered part of the abstract idea being mathematical calculations of subtracting summations. The specification [0021-24] describes this language in characterizing Equation 1. Further specifying the math to comprise subtracting summations serves to embellish the abstract idea.
Dependent claim 3 discloses limitations of fourth and fifth steps to perform weighted voting and reducing reward, respectively. The fourth and fifth steps are considered part of the abstract idea being mathematical calculation or mental evaluation. For example, [0038] “weight is imposed on the opinions of agree and disagree” is an opinion-based bias and [0039] “imposes a tax of ‘2’ on the reward… given as ‘r=a-2’” is a subtractive difference. The steps are performed by a processor and memory which are additional elements falling under MPEP 2106.05(f) mere use of computer as a tool to perform the abstract idea, and which does not qualify as a particular machine under MPEP 2106.05(b). Therefore, the additional elements do not integrate the judicial exception into a practical application or amount to significantly more.
Dependent claim 5 discloses wherein the agent is a mobile body. The limitation is considered additional elements as part of generally linking the use of the judicial exception to a particular technological environment or field of use under MPEP 2106.05(h). A mobile body can be within a video game simulation such as Mario Kart or Starcraft which have been explored in reinforcement learning environments. Example environments comprising mobile body is shown by Juliani’s Unity ML-Agents Toolkit at Fig 3 such as crawler 4-legged creatures, walker humanoid agents or reacher double-jointed arm. Such mobile bodies are not embedded hardware for autonomous driving. As such, the additional elements are not found to meaningfully limit the claim in demonstration of an inventive concept. The additional element falls short of integrating the judicial exception into a practical application or amount to significantly more.
Taken alone, their additional elements do not amount to significantly more than the above-identified judicial exception (the abstract idea). Looking at the limitations as an ordered combination adds nothing that is not already present when looking at the elements taken individually. There is no indication that the combination of elements improves the functioning of a computer or improves any other technology. Their collective functions merely provide conventional computer implementation.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-2 and 5-6 are rejected under 35 U.S.C. 103 as being unpatentable over:
Hu et al., US PG Pub No 2019/0266489A1 (Honda, Tokyo) in view of
Moissinac, Béatrice, “Reinforcement Learning-based Off-Equilibrium Incentives to Approximate the VCG Mechanism” hereinafter Moissinac (Thesis, OSU).
With respect to claim 1, Hu teaches:
A learning system for performing reinforcement learning of a cooperative action by a plurality of agents under a multi-agent system in which the plurality of agents perform the cooperative action, the learning system, that is implemented by a computer {Hu discloses [0013] “a system for cooperative multi-goal, multi-agent, multi-stage (CM3) reinforcement learning” Figs 1 and 21 is computer system for Fig 5A is reinforcement learning e.g. [0084] “multi-agent reinforcement learning (MARL)” where the agents comprise vehicles Fig 1:170,180 or shown Figs 2, 9, 14} comprising:
the plurality of agents {Hu [0004] “the N number of agents in the multi-agent environment” e.g. Fig 3:304 “Train number of N agents to learn cooperation between agents in multi-agent environment”}; and
a non-transitory memory configured to store a program {Hu [0045] “non-transitory computer-readable storage medium storing computer-executable instructions” e.g. RAM [0036] or shown Figs 1:104 and 21:2118 “memory” for executing software programs [0253-58]},
a processor configured to execute the program stored on the memory to cause the learning system {Hu [0055] “Fig.1 …CM3 reinforcement learning may include a processor 102, memory 104” [0253-58]} to:
grant a reward to the plurality of agents {Hu Fig 1:132 reward module for agent vehicles, granting is assigning [0231] “each agent may be assigned with different individual rewards” similar at [0240], 16:1608 “assigning negative reward to agents”}, wherein each of the agents includes
a second non-transitory memory configured to store a program {Hu Fig 11:1104 “memory” of each agent vehicle corresponds to Fig 1:170/180 with storage device 174/184, [0036, 45]};
a second processor configured to execute the program stored on the memory to cause the learning system {Fig 11:1102 “processor” of each agent vehicle corresponds Fig 1:170/180, [0036,45]} to:
acquire a state of the agent {Hu discloses [0223] “set of states S describing possible configurations of all agents” introduced [0069] “agents may be completely specified by a state st ∈ S, …next state st+1” where ∈ denotes set membership};
acquire the reward {Hu [0223] “Each agent may receive a reward” again at [0069], acquire is receive and the system Fig 1 provides communication interfaces 172/182 and 150, [0057-59]};
select an action based on the state and the reward by using a decision-making model for selecting the action {Hu [0062] “select one action from a set of available actions, which results in a new state and a new reward” Fig 2 shown as loop, decision-making model comprises MDP Markov decision policy, e.g. [0224-23] “policy that defines how actions are selected” similarly [0209], where [0240] “policy may be based on a state” and [0187-88] “policy based on the reward function” again [0164-65] “choose the ‘best’ action… return of an action (a) under policy (π) on state (s)”}; and
execute the action {Hu Fig 5A “Execute action(s)” after sampling action i.e. selected [0224-23]. By example, [0183] “actions to include an autonomous vehicle driving maneuver of accelerating” or actions of deceleration, braking and lane shift introduced [0006]}, the processor is further configured to execute the program stored on the memory to cause the learning system to perform {Hu Fig 1:!02 and 21:2114, [0055-56] “CM3 reinforcement learning may include a processor 102, a memory 104” [0252-53] “processor-executable instructions”}
a first step of, in the presence of a target agent to which the reward is to be granted, calculating an evaluation value relating to a cooperative action of other agents as a first evaluation value {Hu [0088,087-90] Eq.1 ∑ y’R(st, at) “value function” again at [0226,225-28]; thereby [0081] “learning to cooperate in the presence of other agents” with evaluation comprising the R-reward [0049] which can be “individual reward” [0100,0231]. Fig 4 illustrates 2-stage curriculum learning that uses local information for single-agent setting (target agent) i.e. Fig 3:302 or [0205] “first agent is the only agent present”, and then a multi-agent setting with global data Fig 3:304. By example, see Fig 6 multi-car agents where speed is a calculated value};
a second step of, in the absence of the target agent, calculating an evaluation value relating to a cooperative action of the other agents as a second evaluation value {Hu [0062] “the absence of other agents may more reliably produce the desired state configurations for learning cooperative behavior” again [0205]. Notably, second calculated value may comprise [0107-13] “joint reward” Rg(s,a,g):=∑n=1N R(s, an, gn) may be defined” for a joint or overall policy which is described in as “weighted sum” similarly [0231,36]. See also [0090] “address the issue of multi-agent credit assignment… contribution of an agent’s chosen action and versus the average of all possible counterfactual actions a, keeping other agent’s actions fixed” and [0152] “differentiate between evaluations of the Q-function for different agents”}; and
the target agent performs learning of the decision-making model based on the reward granted {Hu [0205] “perform training for a first agent based on a first policy …using a Markov decision process (MDP)” learning is training, Markov-decision based policy- is model and reward includes return [0223] “Markov games may be defined… total expected return Rn=∑t=0Tytrtn” simulation of agent is described, Figs 2-3 show high level and Figs 4-5 detail technical implementation}.
Hu further discloses difference error [0227], subtractive functions throughout and penalties for agents as negative reward e.g. [0241], [0005].
However, Hu does not prima facie disclose the following limitation particularly when read in light of specification [0021].
Moissinac teaches:
a third step of calculating a difference between the first evaluation value and the second evaluation value as a penalty of the target agent and calculating the reward to be granted to the target agent based on the penalty {Moissinac [P.18] Eq 3.4 ui(∙) = ∑v(∙) - ∑v(∙) which is applied in Algorithm 1 [P.16] “Compute D(i) (Equation 3.4)” Reinforcement Learning using Difference, i.e. [P.18] “difference reward D(i) is implemented in the same way as ui …Thus, the utility, valuation and payment of the agent are implemented the same was as with the VCG” Vickrey-Clark-Grove mechanism is equated as solution per instant inventors Matsunami et al., “Reward Design for Multi-Agent Reinforcement Learning with a Penalty Based on the Payment Mechanism” at [P.3 Sect2.3-3] and/or attorney remarks to EPO as per Scheible at [P.3 ¶3-4]. Instant specification [0021] details such statistical function for implementation. Additionally, Moissinac illustrates reinforcement learning Fig 2.2, with multiagent shown Fig 2.3 where a world (environment) with and without agent ‘i’ are considered as a difference reward, penalty is [P.7 ¶1] “penalizing an agent’s utilities” and/or negative payment for agent [P.9 ¶1] subject to the subtractive function}, and
Moissinac is directed to multiagent systems for reinforcement learning thus being analogous. A person having ordinary skill in the art would have considered it obvious prior to the effective filing date to shape reward using the technique of Moissinac in combination with Hu to arrive at the invention as claimed as applying a known technique to a known device ready for improvement to yield predictable results which is accomplished by using “a well-known solution to solve resource allocation problems with multiple items and multiple agents” [P.7 ¶1] and/or stated motivation [cont’d] “motivates a truthful participation from agents by penalizing an agent’s utilities with the cost of her distortion to the system’s value” [P.7 ¶1] and which “produces a social-optimal outcome” [P.2 ¶3]. Technical contributions are noted [P.4 ¶1-2] and real world domains of navigation, robot coordination and air traffic control are disclosed per [P.14 ¶2].
With respect to claim 2, the combination of Hu and Moissinac teaches the learning system according to claim 1, wherein
the first evaluation value corresponds to an amount of increase given by subtracting the sum of evaluation values relating to a cooperative action before the other agents perform an action in the presence of the target agent from the sum of evaluation values relating to a cooperative action after the other agents perform the action in the presence of the target agent {Moissinac [P.18] Eq 3.4 where term on the left, i.e. ∑n∈N vn(kn, θn), is first evaluation value subject to subtracting the sum denoted ∑ sigma symbol, read in light of instant specification [0021,23]. Moissinac shows multi-agent RL Fig 2.3 and Eq2.7 establishes world with and without agent ‘i’, presence is described as being “in which i exists” this is reflected in the superscript of Eq3.4 which equates to Eq3.5 being same as Eq2.7. Further, the aspect of before and after is such that sequential learning is performed in multiagent reinforcement learning over time-steps [P.12 Sect2.3.1]}, and
the second evaluation value corresponds to an amount of increase given by subtracting the sum of evaluation values relating to a cooperative action before the other agents perform actions in the absence of the target agent from the sum of evaluation values relating to a cooperative action after the other agents perform the actions in the absence of the target agent {Moissinac [P.18] Eq 3.4 where term on the right, i.e. ∑n∈N-i vn(k-in, θ-in), is second evaluation value subject to subtracting the sum denoted ∑ sigma symbol, read in light of instant specification [0021,24]. Moissinac shows multi-agent RL Fig 2.3 and Eq2.7 establishes world with and without agent ‘i’, absence is noted “where i does not exist” this is reflected absence of superscript of Eq3.4 which equates to Eq3.5 being same as Eq2.7. The aspect of before and after is such that sequential learning is performed in multiagent reinforcement learning over time-steps [P.12 Sect2.3.1]}.
A person having ordinary skill in the art would have considered it obvious prior to the effective filing date to specify evaluation values according to the teachings of Moissinac in combination for motivation as already applied in claim 1 and further as “The VCG framework and the DU framework are somewhat distant field, so the notations have been unified” [P.6 ¶2].
Claim 4 (Cancelled).
With respect to claim 5, the combination of Hu and Moissinac teaches the learning system according to claim 1, wherein
the agent is a mobile body {Hu [0043] “An ‘agent’, as used herein, may refer to a ‘vehicle’” illustrated e.g. Fig 2}.
With respect to claim 6, the rejection of claim 1 is incorporated. The difference in scope being a method for performing limitations of system claim 1. Hu discloses [0046] “method for cooperative multi-goal, multi-agent, multi-stage reinforcement learning” methods comprising CM3 algorithm shown Fig 5 and [0252] “techniques presented herein”. The remainder of this claim is rejected for the same rationale as claim 1.
Claims 7-10 (Cancelled).
Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Hu and Moissinac in view of Yu et al., “Collective Learning for the Emergence of Social Norms in Networks Multiagent Systems” herein after Yu, and further in view of Wang et al., “Shapley Q-value: A Local Reward Approach to Solve Global Reward Games” hereinafter Wang (arXiv: 1907.05707v4).
With respect to claim 3, the combination of Hu and Moissinac teaches the learning system according to claim 1 wherein the processor is further configured to execute the program stored on the memory to cause the learning system to perform, but does not disclose the following limitations. Yu teaches wherein the reward granting unit performs:
a fourth step of causing the plurality of agents to perform weighted voting relating to whether to perform a cooperative action {Yu [P.2347 Sect.B ¶2-5] “Weighted voting” Eq.6 details performance of weighted voting, the weighting comprises “weight of each agent” and considers “interacting agents” actions of agents are described with reinforcement learning and rewards, introducing [P.2342 ¶1] “cooperation among distributed agents to achieve a consensus in the whole system”}; and
Further Hu discloses discounting reward [0172,0223] and presence/absence of agents [0081,62].
However, the combination Hu and Yu does not prima facie disclose that reduced reward is subject to a condition of the following limitation.
Wang teaches:
a fifth step of, when a result of voting obtained in the absence of the target agent overturns, in a manner attributable to statistical analysis of voting among agents, a result of voting in the presence of the target agent, reducing a reward to be granted to the target agent by an amount of reward determined based on the result of voting in the absence of the target agent {Wang discloses [P.5 Sect5.1 ¶1] “reward will be reduced by 1” or [P.6 Last2¶] “reward will be reduced by 10” where 1 or 10 are scalar amounts of reduction in reward and describes agents such as cars in traffic or navigation settings. Mathematically, [P.3 Sect4.1] “∑yt-1rt(C)” where y is discount factor for the r-reward. The condition of when voting result in absence of agent overturns result of voting in presence of agent is interpreted as satisfied by Shapley value approximation which considers consider marginal contribution of each agent and may “mask the actions of irrelevant agents” considering “every possible combination of coalition C and agent i” [P.4 Last2¶] }.
Wang is directed to multi-agent reinforcement learning thus being analogous. A person having ordinary skill in the art would have considered it obvious prior to the effective filing date to reduce reward using the techniques of Wang in combination to arrive at the invention as claimed for the stated motivation “our motivation of importing the local reward approach (e.g., Shapley value) to solve out the credit assignment problems” [P.7 ¶2] and “ensures a reasonable payoff distribution and inspires our work on credit assignment in MARL” [P.2 Last¶].
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Chase P Hinckley whose telephone number is (571)272-7935. The examiner can normally be reached M-F 9:00 - 5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda M. Huang can be reached at 571-270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/CHASE P. HINCKLEY/Examiner, Art Unit 2124
/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124