Last updated: April 19, 2026
Application No. 17/790,574
POLICY LEARNING METHOD, POLICY LEARNING APPARATUS, AND PROGRAM

Final Rejection §101§103§112
Filed
Jul 01, 2022
Examiner
WU, NICHOLAS S
Art Unit
2148
Tech Center
2100 — Computer Architecture & Software
Assignee
NEC Corporation
OA Round
2 (Final)
Interview Optional

— +43.1% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 38 resolved cases, 2023–2026
Examiner Intelligence

WU, NICHOLAS S View full profile →
Grants 47% of resolved cases
Career Allow Rate
18 granted / 38 resolved
-7.6% vs TC avg
Strong +43% interview lift
Without
With
+43.1%
Interview Lift
resolved cases with interview
Typical timeline
3y 9m
Avg Prosecution
44 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
26.7%
-13.3% vs TC avg
§103
52.6%
+12.6% vs TC avg
§102
3.1%
-36.9% vs TC avg
§112
17.4%
-22.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 38 resolved cases
Office Action

§101 §103 §112
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant's arguments filed 10/02/2025 have been fully considered but they are not persuasive.
Regarding the 101 rejections, on page 11 of “Remarks” applicant contends that the amended claim 1 does not recite abstract ideas under Step 2A Prong 1. Specifically, applicant argues that the specific limitations applying the first action element and further applying each of the second available selections associated with the second action element to obtain the another state for each of the second available selections, listing the another state, calculating a reward for changing to the another state and a value of the another state, and determining the another state based on the reward and the value; generating training data based on information used when determining the another state, and further training the machine learning model by using the training data; and transmitting the trained machine learning model to an electronic device do not recite an abstract idea. The examiner respectfully disagrees. Under the broadest reasonable interpretation, the limitation applying the first action element and further applying each of the second available selections associated with the second action element to obtain the another state for each of the second available selections, listing the another state, calculating a reward for changing to the another state and a value of the another state, and determining the another state based on the reward and the value; recites determining a state based on calculating a reward and a state value for each action chosen which is a step of evaluation and judgement which can be performed mentally or with pen and paper. The steps of evaluation and judgement are mental processes (MPEP 2106). Similarly, under the broadest reasonable interpretation, the limitation generating training data based on information used when determining the another state, recites providing feedback data that provides the biggest improvement to the model based on the rewards and values which is a step of evaluation and judgement which can be performed mentally or with pen and paper. The steps of evaluation and judgement are mental processes (MPEP 2106). Therefore, the specific limitations argued recite an abstract idea.
	Additionally, on pages 11-12 of “Remarks” applicant contends that the specific limitation mentioned above provides additional elements that incorporate the identified abstract ideas into a practical application under Step 2A Prong 2. Specifically, applicant argues that the limitations of and further training the machine learning model by using the training data; and transmitting the trained machine learning model to an electronic device incorporate the abstract ideas into a practical application. The examiner respectfully disagrees. Under the broadest reasonable interpretation, the limitation and further training the machine learning model by using the training data; merely recite steps that apply generic training of a generic machine learning model which represents merely adding the words “apply it”, or an equivalent, which are not indicative of an inventive concept (MPEP 2106.05(f)). Under the broadest reasonable interpretation, the limitation and transmitting the trained machine learning model to an electronic device recite steps of mere data gathering, which has been recognized by the courts as being well-understood, routine, and conventional functions. Specifically, the courts have recognized computer functions directed to mere data gathering as well-understood, routine, and conventional functions when they are claimed in a merely generic manner or as insignificant extra-solution activity (MPEP 2106.05(g)). 
	Additionally, on page 12 of “Remarks” applicant contends that the claimed invention provides a technical solution to specific neural network model “These specific limitations relate to a practical application that overcomes a technical problem with prior solutions. The specification lists out the problems in the context of attention models, reinforcement learning, and Actor-Critic method. Specification, §§ 2-8. In particular, the specification lays out "the number of types of actions which the agent can execute varies for each state of the environment, a neural network cannot output a probability distribution with different numbers of elements (corresponding to the types of actions) for each state because the number of units in the output layer of the neural network is fixed" making it difficult to apply these methods in many instances. Id., 8.” However, the claims do not have limitations that specify that the machine learning model being used is a neural network let alone a reinforcement learning model, attention model, or actor-critic model. Therefore, the additional elements do not incorporate the identified abstract ideas into a practical application. Therefore, applicant’s arguments regarding the 101 rejections are not persuasive.
Regarding the 103 rejections, applicant's arguments filed with respect to the prior art rejections have been fully considered but they are moot. Applicant has amended the claims to recite new combinations of limitations. Applicant's arguments are directed at the amendment. Please see below for new grounds of rejection, necessitated by Amendment.

Claim Rejections - 35 USC § 112: Indefiniteness
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-14 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Regarding claim 1 and analogous claims 7 and 13, the limitation calculating a selection rate for each first available selection from the first available selections associated with the state of the action element is indefinite as it is unclear whether the first available selections are associated with a state or not associated with a state as a prior limitation in claim 1 wherein first available selections associated with the first action element do not depend upon a state of the action element contradicts the limitation. For the purposes of examination, the calculation of the selection rate is interpreted as being applied to any action element.
Regarding claims 2-6, 8-12, and 14, the claims are rejected for at least their dependence to one of the independent claims. 

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-14 are rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. 
Regarding claim 1, in step 1 of the 101 analysis set forth in MPEP 2106, the claim recites A method for training a machine learning model for policy understanding, the method being executed by at least one processor. The claim recites a method. A method is one of the four statutory categories of invention.  
In Step 2A, Prong 1 of the 101 analysis set forth in MPEP 2106, the examiner has determined that the following limitations recite a process that, under broadest reasonable interpretation, covers a mental process or mathematical concept but for the recitation of generic computer components:
the method comprising: based on an action element being selected upon determination that a predetermined state in a predetermined environment has changed to another state, identifying a first action element that is state-independent and identifying a second action element that is state-dependent, wherein first available selections associated with the first action element do not depend upon a state of the action element, and wherein second available selections associated with the second action element depend upon the state of the action element; (i.e., the broadest reasonable interpretation includes a step of evaluation and observation and could be performed mentally or with pen and paper like identifying a state dependent action and state independent action when a states changes, which is either a mental process of evaluation/observation (MPEP 2106)).
calculating a selection rate for each first available selection from the first available selections associated with the state of the action element …and selecting the first action element based on the selection rate; (i.e., the broadest reasonable interpretation includes a step of evaluation and judgement and could be performed mentally or with pen and paper like ranking actions/choices based on which ones were selected the most in previous iterations, which is either a mental process of evaluation/judgement (MPEP 2106)).
applying the first action element and further applying each of the second available selections associated with the second action element to obtain the another state for each of the second available selections, listing the another state, calculating a reward for changing to the another state and a value of the another state, and determining the another state based on the reward and the value; (i.e., the broadest reasonable interpretation includes a step of evaluation and judgement and could be performed mentally or with pen and paper determining a state based on the largest reward and value, which is either a mental process of evaluation/judgement (MPEP 2106)).
generating training data based on information used when determining the another state, (i.e., the broadest reasonable interpretation includes a step of evaluation and judgement and could be performed mentally or with pen and paper like providing feedback data that provides the biggest improvement to the model based on the rewards and values, which is either a mental process of evaluation/judgement (MPEP 2106)).
If the claim limitations, under their broadest reasonable interpretation, covers activities classified under Mental processes: concepts performed in the human mind (including observation, evaluation, judgement, or opinion) (see MPEP 2106.04(a)(2), subsection (III)) or Mathematical concepts: mathematical relationships, mathematical formulas or equations, or mathematical calculations (see MPEP 2106.04(a)(2), subsection (I)). Accordingly, the claim recites an abstract idea.
In Step 2A, Prong 2 of the 101 analysis, set forth in MPEP 2106, the examiner has determined that the following additional elements do not integrate this judicial exception into a practical application:
using a machine learning model which is being trained, (i.e., the generic computer components recited in this limitation merely add the words “apply it”, or an equivalent, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea (MPEP 2106.05(f))).
and further training the machine learning model by using the training data; (i.e., the generic computer components recited in this limitation merely add the words “apply it”, or an equivalent, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea (MPEP 2106.05(f))).
and transmitting the trained machine learning model to an electronic device. (i.e., the broadest reasonable interpretation of transmitting a model to a device is mere data gathering, which is an insignificant extra solution activity (MPEP 2106.05(g))).
Since the claim does not contain any other additional elements, that amount to integration into a practical application, the claim is directed to an abstract idea. 
In Step 2B of the 101 analysis set forth in the 2019 PEG, the examiner has determined that the claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception:
Regarding limitation (VII), under the broadest reasonable interpretation, recite steps of mere data gathering, which has been recognized by the courts as being well-understood, routine, and conventional functions. Specifically, the courts have recognized computer functions directed to mere data gathering as well-understood, routine, and conventional functions when they are claimed in a merely generic manner or as insignificant extra-solution activity when considering evidence in view of Berkheimer v. HP, Inc., 881 F.3d 1360, 1368, 125 USPQ2d 1649, 1654 (Fed. Cir. 2018), see USPTO Berkheimer Memorandum (April 2018)). 
Examiner uses Berkheimer: Option 2, a citation to one or more of the court decisions discussed in MPEP 2106.05(d)(II) as noting well-understood, routine, and conventional nature of the additional elements:
Receiving or transmitting data over a network, e.g., using the Internet to gather data, Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362 (utilizing an intermediary computer to forward information); TLI Communications LLC v. AV Auto. LLC, 823 F.3d 607, 610, 118 USPQ2d 1744, 1745 (Fed. Cir. 2016) (using a telephone for image transmission); OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network); buySAFE, Inc. v. Google, Inc., 765 F.3d 1350, 1355, 112 USPQ2d 1093, 1096 (Fed. Cir. 2014) (computer receives and sends information over a network). See MPEP 2106.05(d)(II).
Further, limitations (V) and (VI), under the broadest reasonable interpretation, recite mere instructions to apply generic use and training of a generic machine learning model using training data which represents merely adding the words “apply it”, or an equivalent, which are not indicative of an inventive concept (MPEP 2106.05(f)). Considering additional elements individually and in combination, and the claim as a whole, the additional elements do not provide significantly more than the abstract idea. Therefore, the claim is not patent eligible.
Regarding claim 2, it is dependent upon claim 1 and fails to resolve the deficiencies identified above by integrating the judicial exception into a practical application, or introducing significantly more than the judicial exception. For example, claim 2 recites the method further comprising: calculating the value of the another state by using a second model which is being trained; and further training the second model by using the training data. Under the broadest reasonable interpretation, the limitations merely recite steps that apply a machine learning for generic training and output, which represents merely adding the words “apply it”, or an equivalent, which are not indicative of an inventive concept (MPEP 2106.05(f)). Therefore, claim 2 does not solve the deficiencies of claim 1.
Regarding claim 3, it is dependent upon claim 1 and fails to resolve the deficiencies identified above by integrating the judicial exception into a practical application, or introducing significantly more than the judicial exception. For example, claim 3 recites the method further comprising determining the another state based on maximizing a sum of the reward and the value. Under the broadest reasonable interpretation, the limitation recites determining a state based on maximizing a sum of two values which is a step of evaluation and judgement which can be performed mentally or with pen and paper. The steps of observation, evaluation, and judgement are mental processes. Therefore, claim 3 does not solve the deficiencies of claim 1.
Regarding claim 4, it is dependent upon claim 1 and fails to resolve the deficiencies identified above by integrating the judicial exception into a practical application, or introducing significantly more than the judicial exception. For example, claim 4 recites the method further comprising generating the training data in which at least the state, the first action element, and a maximum value of a sum of the reward and the value calculated when determining the another state are associated. Under the broadest reasonable interpretation, merely recite steps that further limits the abstract idea by indicating that the values are all related to each other. Further limiting the abstract idea is still interpreted as an abstract idea. Therefore, claim 4 does not solve the deficiencies of claim 1.
Regarding claim 5, it is dependent upon claim 1 and fails to resolve the deficiencies identified above by integrating the judicial exception into a practical application, or introducing significantly more than the judicial exception. For example, claim 5 recites wherein based on the predetermined environment being a graph rewriting system in which a graph serving as the state is rewritten and shifted to another graph serving as the another state, the first action element is a graph rewriting rule representing a rule for rewriting the graph, and the second action element is a rule application location representing a location to apply the graph rewriting rule in the graph. Under the broadest reasonable interpretation, merely recite steps that amount to indicating a field of use or technological environment in which to apply a judicial exception (MPEP 2106.05(h)). Therefore, claim 5 does not solve the deficiencies of claim 1.
Regarding claim 6, it is dependent upon claim 5 and fails to resolve the deficiencies identified above by integrating the judicial exception into a practical application, or introducing significantly more than the judicial exception. For example, claim 6 recites the method further comprising: calculating another selection rate of each of choices of the graph rewriting rule in the graph by using the machine learning model, and selecting the graph rewriting rule based on the another selection rate;. Under the broadest reasonable interpretation, the limitations recite assigning a selection rate to each action/choice which is a step of evaluation and judgement which can be performed mentally or with pen and paper. The steps of observation, evaluation, and judgement are mental processes. Additionally, the limitation recites the use of a generic machine learning model to perform the selection which merely recites steps which merely recite steps that apply a generic machine learning model as a tool to perform a judicial exception, which represents merely adding the words “apply it”, or an equivalent, which are not indicative of an inventive concept (MPEP 2106.05(f)). Claim 6 also recites and applying the selected graph rewriting rule to each rule application locations in the graph to obtain the another state, calculating the reward and the value for the another state, and determining the another state based on the reward and the value. Under the broadest reasonable interpretation, the limitations recite determining a reward and value for a state given rules which is a step of evaluation and judgement which can be performed mentally or with pen and paper. The steps of observation, evaluation, and judgement are mental processes. Therefore, claim 6 does not solve the deficiencies of claim 5.
Regarding claims 7-13, they are similar to claims 1-6 and rejected under the same rationales.
Regarding claim 14, it is dependent upon claim 1 and fails to resolve the deficiencies identified above by integrating the judicial exception into a practical application, or introducing significantly more than the judicial exception. For example, claim 14 recites the method further comprising: from first available selections and the second available selections, listing a respective another state for each of the second available selections to which the first action element is applied. Under the broadest reasonable interpretation, the limitations recite steps of mere data outputting, which has been recognized by the courts as being well-understood, routine, and conventional functions. Specifically, the courts have recognized computer functions directed to mere data outputting as well-understood, routine, and conventional functions when they are claimed in a merely generic manner or as insignificant extra-solution activity (MPEP 2106.05(g)). Therefore, claim 14 does not solve the deficiencies of claim 1.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-4, 7-10, and 13-14 are rejected under 35 U.S.C. 103 as being unpatentable over Hermann, et al., US Pre-Grant Publication 2021/0110115A1 (“Hermann”) in view of Friske, et al., Non-Patent Literature “Speeding Up Autonomous Learning by Using State-Independent Option Policies and Termination Improvement” (“Friske”) and further in view of Yao, et al., US Pre-Grant Publication 2019/0101917A1 (“Yao”).
Regarding claim 1 and analogous claims 7 and 13, Hermann discloses:
A method for training a machine learning model for policy understanding, the method being executed by at least one processor, (Hermann, ⁋6, “a system for selecting actions to be performed by an agent interacting with an environment [A method for training a machine learning model for policy understanding,], the system including one or more computers [the method being executed by at least one processor,] and one or more storage devices storing instructions.”).
the method comprising: based on an action element being selected upon determination that a predetermined state in a predetermined environment has changed to another state, (Hermann, ⁋70, “The action selection system 100 selects actions 102 [the method comprising: based on an action element being selected] to be performed by an agent 104 interacting with an environment 106 at each of multiple time steps. In order for the agent 104 to interact with the environment 106, at each time step, the system 100 receives an input and selects an action 102 to be performed by the agent 104 in response to the received input. Generally, the input includes data characterizing the current state of the environment 106; multiple time steps of the environment is interpreted as the environment shifting to another state (i.e. upon determination that a predetermined state in a predetermined environment has changed to another state,)”).
identifying a first action element…and identifying a second action element that is state-dependent,…and wherein second available selections associated with the second action element depend upon the state of the action element;  (Hermann, ⁋70, “The action selection system 100 selects actions 102 [identifying a first action element…and identifying a second action element] to be performed by an agent 104 interacting with an environment 106 at each of multiple time steps. In order for the agent 104 to interact with the environment 106, at each time step, the system 100 receives an input and selects an action 102 to be performed by the agent 104 in response to the received input. Generally, the input includes data characterizing the current state [that is state-dependent,…and wherein second available selections associated with the second action element depend upon the state of the action element; ] of the environment 106”).
calculating a selection rate for each first available selection from the first available selections associated with the state of the action element using a machine learning model which is being trained, (Hermann, ⁋79, “In some implementations, the action selection output 124 defines a probability distribution over actions in a set of possible actions that can be performed by the agent 104. For example, the action selection output 124 may include a respective numerical likelihood value for each action in the set of possible actions; a probability distribution of all the possible choices/actions is interpreted as a selection rate (i.e. calculating a selection rate for each first available selection from the first available selections associated with the state of the action element).” and Hermann, ⁋78, “For example, the system 100 may process the current combined embedding 120 using an action selection neural network 122 [using a machine learning model which is being trained,] in accordance with current values of action selection neural network parameters to generate an action selection output 124.”).
and selecting the first action element based on the selection rate; (Hermann, ⁋79, “In some of these implementations, the system 100 selects the action 102 to be performed by the agent 104 by sampling an action from the set of possible actions based on the probability distribution [and selecting the first action element based on the selection rate;].”).
applying the selected first action element (Hermann, ⁋79, “In some of these implementations, the system selects the action 102 to be performed by the agent 104 [applying the selected first action element] as the action from the set of possible actions having the highest probability according to the probability distribution.”).
further applying each of the second available selections associated with the second action element to obtain the another state for each of the second available selections,… (Hermann, ⁋42, “the prediction auxiliary neural network module is configured to predict a future state [to obtain the another state for each of the second available selections,…] of the environment input data from a current state of the environment input data and the action data [further applying each of the second available selections associated with the second action element].”).
…calculating a reward for changing to the another state and a value of the another state, and determining the another state based on the reward and the value; (Hermann, ⁋106-107, “The reward can be based on any event or aspect of the environment. For example, the reward may indicate whether the agent has successfully performed a task specified by the text string. In some implementations, the system receives a current reward as a result of the agent performing a current action in response to a current observation; the current reward in this example is interpreted as the other reward as the previous reward shifts to the current, or other state (i.e. …calculating a reward for changing to the another state), as described with reference to FIG. 2. In some implementations, the system obtains data including a previous reward, a previous action performed by the agent, and an observation characterizing a previous state of the environment at a previous time step from a replay memory…When the reinforcement learning technique is an actor-critic technique, the action selection output of the action selection neural network may include an estimate of the value of the current state [and a value of the another state,], i.e., a time-discounted return resulting from the environment being in the current state; one of ordinary skill knows that the actor-critic method determines the other state based on the reward and value as the actor selects the next action and the critic provides feedback based on the actor’s selection (i.e. and determining the another state based on the reward and the value;).”).
generating training data based on information used when determining the another state, and further training the machine learning model by using the training data; (Hermann, ⁋93, “The training engine 128 determines task-specific parameter updates [generating training data] for the system 100 using an actor-critic reinforcement learning technique [based on information used when determining the another state, and further training the machine learning model by using the training data;] based on the rewards received by the system 100, as will be described further with reference to FIG. 3.”).
and transmitting the trained machine learning model to an electronic device. (Hermann, ⁋78, “For example, the system 100 may process the current combined embedding 120 using an action selection neural network 122 [the trained machine learning model] in accordance with current values of action selection neural network parameters to generate an action selection output 124.” and Hermann, ⁋69, “FIG. 1A shows an example action selection system 100. The action selection system 100 is an example of a system implemented as computer programs on one or more computers in one or more locations in which the systems, components, and techniques described below are implemented [and transmitting…to an electronic device.].”).
While Hermann teaches a policy learning method using an actor-critic reinforcement learning algorithm, Hermann does not explicitly teach:
…that is state-independent…wherein first available selections associated with the first action element do not depend upon a state of the action element,…
…listing the another state,…
Friske teaches …that is state-independent…wherein first available selections associated with the first action element do not depend upon a state of the action element,… (Friske, pg. 1 col. 1, “This paper analyses options OS, which can be defined as fixed sequences of actions, where each one of them is taken independently of the current state […that is state-independent…wherein first available selections associated with the first action element do not depend upon a state of the action element,…].”).
Hermann and Friske are both in the same field of endeavor (i.e. reinforcement learning). It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to combine Hermann and Friske to teach the above limitation(s). The motivation for doing so is that using state-independent actions allows for more robustness in applications as the independent actions aid in environments where the states are unknown (cf. Friske, pg. 1 col. 2, “Thus, these policies can be more appropriate for problems where there is not complete observation of the visited states…The independency with respect to the states provides flexibility in the use of the option in the state space. In other words, an option Os can be chosen in any state”).
While Hermann in view of Friske teaches a policy learning method using a state-dependent and state-independent actor-critic reinforcement learning algorithm, the combination does not explicitly teach:
…listing the another state,…
Yao teaches …listing the another state,… (Yao, ⁋5, “the method comprises obtaining a plurality of action sequences each comprising a plurality of actions for the object, wherein the one or more predicted subsequent states; predicting multiple subsequent states is interpreted as listing the states (i.e. …listing the another state,…) of the object in the environment are predicted using the action model, the current state of the object in the environment and the plurality of action sequences.”).
Hermann, in view of Friske, and Yao are both in the same field of endeavor (i.e. reinforcement learning). It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to combine Hermann, in view of Friske, and Yao to teach the above limitation(s). The motivation for doing so is that using predicting multiple subsequent states improves stability and reliability of a policy (cf. Yao, ⁋8, “a pre-learned/pre-trained action model to predict subsequent states based on a current state and a selected action to provide improved learning samples resulting in faster convergence, requiring fewer samples, and outputting a policy that is more reliable, more stable”).
Regarding claim 2 and analogous claim 8, Hermann in view of Friske and Yao teaches the method according to claim 1. Hermann further teaches calculating the value of the another state by using a second model which is being trained; and further training the second model by using the training data. (Hermann, ⁋93, “The training engine 128 determines task-specific parameter updates for the system 100 using an actor-critic reinforcement learning technique; using an actor-critic methodology is interpreted as having a second model for the critic (i.e. by using a second model which is being trained;) based on the rewards received by the system 100, as will be described further with reference to FIG. 3. The training engine 128 determines auxiliary parameter updates [and further training the second model by using the training data.] for the system 100 based on the next observation prediction 146, the predicted meaningful word 150, the predicted reward 154, and the auxiliary value 156 [calculating the value of the another state], as will be described further with reference to FIG. 4.”).
	
Regarding claim 3 and analogous claim 9, Hermann in view of Friske and Yao teaches the method according to claim 1. Hermann further teaches comprising determining the another state based on maximizing a sum of the reward and the value. (Hermann, ⁋107, “For example, the task-specific objective may be to maximize a time-discounted expected return received in response to actions [comprising determining the another state based on maximizing a sum of the reward] performed by the agent during performance of the current task…The machine learning training technique may be a reinforcement learning technique to optimize the task-specific objective. For example, the reinforcement learning technique may be a policy gradient technique, a one-step or n-step Q learning technique, or an actor-critic technique. When the reinforcement learning technique is an actor-critic technique, the action selection output of the action selection neural network may include an estimate of the value of the current state [and the value.], i.e., a time-discounted return resulting from the environment being in the current state.”).

Regarding claim 4 and analogous claim 10, Hermann in view of Friske and Yao teaches the method according to claim 1. Herman further teaches comprising generating the training data in which at least the state, the first action element, and a maximum value of a sum of the reward and the value calculated when determining the another state are associated. (Hermann, ⁋107, “When the reinforcement learning technique [comprising generating the training data] is an actor-critic technique, the action selection output [the first action element,] of the action selection neural network may include an estimate of the value of the current state [in which at least the state,], i.e., a time-discounted return [and a maximum value of a sum of the reward and the value] resulting from the environment being in the current state; the state, the first action element, max value of a sum of the reward, and the value are interpreted as associates as these values are related and used in the actor-critic learning process (i.e. when determining the another state are associated.).”).

Regarding claim 14, Hermann in view of Friske and Yao teaches the method according to claim 1. Yao further teaches from first available selections and the second available selections, listing a respective another state for each of the second available selections to which the first action element is applied. (Yao, ⁋5, “In some examples, the method comprises obtaining a plurality of action sequences each comprising a plurality of actions for the object, wherein the one or more predicted subsequent states of the object in the environment are predicted using the action model, the current state of the object in the environment and the plurality of action sequences [from first available selections and the second available selections, listing a respective another state for each of the second available selections]” and Yao, ⁋6, “A subsequent state of the object in the environment, s′, is predicted from a current training state, s, from sample data set D {(s, a, s′)}, for at least two corresponding training actions, a [to which the first action element is applied.].”).
It would have been obvious to one of ordinary skill in the art before the effective filling date of the present application to combine the teachings of Yao with the teachings of Hermann and Friske for the same reasons disclosed in claim 1.

Claims 5-6 and 11-12 are rejected under 35 U.S.C. 103 as being unpatentable over Hermann, et al., US Pre-Grant Publication 20210110115/A1 (“Hermann”) in view of Friske, et al., Non-Patent Literature “Speeding Up Autonomous Learning by Using State-Independent Option Policies and Termination Improvement” (“Friske”) and further in view of Yao, et al., US Pre-Grant Publication 2019/0101917A1 (“Yao”) and Segler, Non-Patent Literature “World Programs for Model-Based Learning and Planning in Compositional State and Action Spaces” (“Segler”).
Regarding claim 5 and analogous claim 11, Hermann in view of Friske and Yao teaches the method according to claim 1. 
While Hermann in view of Friske and Yao teaches an actor-critic learning methodology using state-independent actions, the combination does not explicitly teach: wherein based on the predetermined environment being a graph rewriting system in which a graph serving as the state is rewritten and shifted to another graph serving as the another state, the first action element is a graph rewriting rule representing a rule for rewriting the graph, and the second action element is a rule application location representing a location to apply the graph rewriting rule in the graph. 
Segler teaches:
wherein based on the predetermined environment being a graph rewriting system in which a graph serving as the state is rewritten and shifted to another graph serving as the another state, (Segler, pg. 2 col. 1, “Let G be the set of all possible graphs G, which are defined by an unrestricted graph grammar (Rozenberg, 1997). The state space S ⊂ m℘(G), where m℘ denotes all sub-multisets formed by the elements of G. 3 Actions, which transform states into new states, can then be seen as graph rewriting [wherein based on the predetermined environment being a graph rewriting system in which a graph serving as the state is rewritten and shifted to another graph serving as the another state,] rules p : L R, which means a graph L is matched in the graphs in a state s via (sub)graph isomorphism, cut out, and a different graph R is glued in in this position.”).
the first action element is a graph rewriting rule representing a rule for rewriting the graph, and the second action element is a rule application location representing a location to apply the graph rewriting rule in the graph. (Segler, pg. 2 see Figure 1 below, 
    PNG
    media_image1.png
    347
    622
    media_image1.png
    Greyscale
In Figure 1, the rules or subroutines are interpreted as the first action element and the individual shapes that make up the rules are interpreted as the second action items because the shapes are the locations/elements of where to apply the rules as the rules control the how the shapes interact with each other (i.e. the first action element is a graph rewriting rule representing a rule for rewriting the graph, and the second action element is a rule application location representing a location to apply the graph rewriting rule in the graph.)). 
Hermann, in view of Friske and Yao, and Segler are both in the same field of endeavor (i.e. reinforcement learning). It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to combine Hermann, in view of Friske and Yao, and Segler to teach the above limitation(s). The motivation for doing so is that using graph rewriting for action planning increases robustness of a reinforcement learning environment when actions are not known prior (cf. Segler, pg. 1 col. 2, “Here, we describe a general formalism to address this issue for learning and planning in compositional state spaces with a priori unknown actions”).

	Regarding claim 6 and analogous claim 12, Hermann in view of Friske, Yao, and Segler teaches the method according to claim 5. Hermann also teaches calculating the reward and the value for the another state, and determining the another state based on the reward and the value as seen in claim 1. Segler further teaches:
comprising: calculating another selection rate of each of choices of the graph rewriting rule in the graph by using the machine learning model, and selecting the graph rewriting rule based on the another selection rate; (Segler, pg. 3 col. 2, “To learn A(s), we train a neural network to predict the probability over all rules given s, and restrict the available actions to the top-k rules with a cumulative probability of > 99%; selecting the rules based on a probability over all the rules is interpreted as a probability distribution and selecting based on a probability distribution is interpreted as a selection rate (i.e. comprising: calculating another selection rate of each of choices of the graph rewriting rule in the graph by using the machine learning model, and selecting the graph rewriting rule based on the another selection rate;).”).
and applying the selected graph rewriting rule to each rule application locations in the graph to obtain the another state, (Segler, pg. 2 col. 1, “Actions, which transform states into new states [to each rule application locations in the graph to obtain the another state,], can then be seen as graph rewriting rules [and applying the selected graph rewriting rule] p : L R, which means a graph L is matched in the graphs in a state s via (sub)graph isomorphism, cut out, and a different graph R is glued in in this position.”). 
It would have been obvious to one of ordinary skill in the art before the effective filling date of the present application to combine the teachings of Segler with the teachings of Hermann, Friske, and Yao for the same reasons disclosed in claim 5.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Wang, et al., US20190258918A1 discloses a reinforcement learning methodology using actor-critic learning algorithms and an action selection neural network.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NICHOLAS S WU whose telephone number is (571)270-0939. The examiner can normally be reached Monday - Friday 8:00 am - 4:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michelle Bechtold can be reached at 571-431-0762. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/N.S.W./Examiner, Art Unit 2148                                                                                                                                                                                                        /MICHELLE T BECHTOLD/Supervisory Patent Examiner, Art Unit 2148
Read full office action
Prosecution Timeline

Jul 01, 2022
Application Filed
Jun 30, 2025
Non-Final Rejection — §101, §103, §112
Sep 11, 2025
Applicant Interview (Telephonic)
Sep 11, 2025
Examiner Interview Summary
Oct 02, 2025
Response Filed
Jan 07, 2026
Final Rejection — §101, §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/882,311
Patent 12488244
APPARATUS AND METHOD FOR DATA GENERATION FOR USER ENGAGEMENT
2y 5m to grant Granted Dec 02, 2025
17/444,687
Patent 12423576
METHOD AND APPARATUS FOR UPDATING PARAMETER OF MULTI-TASK MODEL, AND STORAGE MEDIUM
2y 5m to grant Granted Sep 23, 2025
17/265,476
Patent 12361280
METHOD AND DEVICE FOR TRAINING A MACHINE LEARNING ROUTINE FOR CONTROLLING A TECHNICAL SYSTEM
2y 5m to grant Granted Jul 15, 2025
17/191,518
Patent 12354017
ALIGNING KNOWLEDGE GRAPHS USING SUBGRAPH TYPING
2y 5m to grant Granted Jul 08, 2025
17/161,152
Patent 12333425
HYBRID GRAPH NEURAL NETWORK
2y 5m to grant Granted Jun 17, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
47%
Grant Probability
90%
With Interview (+43.1%)
3y 9m
Median Time to Grant
Moderate
PTA Risk
Based on 38 resolved cases by this examiner. Grant probability derived from career allow rate.