Last updated: April 19, 2026
Application No. 18/073,651
Learning Neuro-Symbolic World Models

Non-Final OA §101§103
Filed
Dec 02, 2022
Examiner
MCINTOSH, ANDREW T
Art Unit
2144
Tech Center
2100 — Computer Architecture & Software
Assignee
International Business Machines Corporation
OA Round
1 (Non-Final)
Interview Optional

— +18.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 511 resolved cases, 2023–2026
Examiner Intelligence

MCINTOSH, ANDREW T View full profile →
Grants 77% — above average
Career Allow Rate
393 granted / 511 resolved
+21.9% vs TC avg
Strong +18% interview lift
Without
With
+18.0%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
27 currently pending
Career history
538
Total Applications
across all art units
Statute-Specific Performance

§101
14.1%
-25.9% vs TC avg
§103
56.7%
+16.7% vs TC avg
§102
13.5%
-26.5% vs TC avg
§112
7.5%
-32.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 511 resolved cases
Office Action

§101 §103
DETAILED ACTION

This action is responsive to communications filed on December 2, 2022. This action is made Non-Final.
Claims 1-20 are pending in the case. 
Claims 1, 11, and 20 are independent claims.
Claims 1-20 are rejected.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS(s)) submitted on 12/02/2022 and 12/04/2025 is/are in compliance with the provisions of 37 C.F.R. 1.97. Accordingly, the IDS(s) is/are being considered by the examiner.







Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception (i.e., a law of nature, a natural phenomenon, or an abstract idea) without significantly more.
Step 1: Independent claims 1, 11, and 20 are directed towards a method, medium (see Spec Para. 0028), and apparatus, respectively. Therefore, these claims, as well as their dependent claims, are directed towards one of the four statutory categories (process, machine (i.e. apparatus), manufacture, or composition of matter.

With respect to claim 1:
	2A Prong 1:
	Claim 1 recites the following judicial exceptions:
 receiving … a previous state of an environment and a previous action taken by an agent in the environment (mental process –can be performed in the human mind, or by a human using a pen and paper (e.g. a person may receive and record state and action data.).
estimating …  a current state by using a transition model which receives a pair of state and action and produces a next state (mental process –can be performed in the human mind, or by a human using a pen and paper (e.g. a person may receive and record state and action data and use a transition model to produce a next state based on the received data.).
modifying ... an estimate of the transition model so that the modified estimate of the transition model prevents a past invalid action from recurring in a corresponding state, wherein the past invalid action taken in the corresponding state did not cause a change in state (mental process –can be performed in the human mind, or by a human using a pen and paper (e.g. a person may receive and record state and action data and use a transition model to produce a next state based on the received data, and further adjust an estimate of the transition model to avoid issues previously encountered.).
passing ... the current state and the modified estimate of the transition model ... for generation of a next action to take in the environment (mental process –can be performed in the human mind, or by a human using a pen and paper (e.g. a person may receive and record state and action data and use a transition model to produce a next state based on the received data, and further adjust an estimate of the transition model to avoid issues previously encountered, and further transferring the current state data and adjusted estimate for determining a next action to take.).

2A Prong 2: The additional elements recited in the claim do not integrate the judicial exception into a practical application.
	Additional elements:
by a proprioception module ... by the proprioception module ... by the proprioception module ... by the proprioception module (mere instructions to apply the exception or implement the exception on a computer (e.g. – computing by executing a specific module on data; see MPEP §2106.05(f).). The additional elements do not effectively integrate the abstract idea into a practical application. 2B: revisiting the additional elements, the additional elements do not amount to significantly more than the judicial exception – recited high level of generality and corresponds to storing and retrieving information in memory and performing calculations.).
to a model-based RL computer model (generally linking the use of a judicial exception to a particular technological environment or field of use (e.g. using goal-oriented learning to optimize actions; see MPEP §2106.05(h).). The additional elements do not effectively integrate the abstract idea into a practical application. 2B: revisiting the additional elements, the additional elements do not amount to significantly more than the judicial exception – recited high level of generality and corresponds to storing and retrieving information and performing calculations.).

With respect to claim 2:
	2A Prong 1:
	Claim 2 recites the following judicial exceptions:
 storing ... received past state-action pairs ...; and labeling ... state-action pairs ... as valid if a change in the next state was induced and invalid if no change was observed (mental process –can be performed in the human mind, or by a human using a pen and paper (e.g. a person may receive and record state and action data and label the validity based on state change status.).

2A Prong 2: The additional elements recited in the claim do not integrate the judicial exception into a practical application.
	Additional elements:
by a proprioception module ... in a memory ... in the memory (mere instructions to apply the exception or implement the exception on a computer (e.g. – computing by executing a specific module on data and storing data for use; see MPEP §2106.05(f).). The additional elements do not effectively integrate the abstract idea into a practical application. 2B: revisiting the additional elements, the additional elements do not amount to significantly more than the judicial exception – recited high level of generality and corresponds to storing and retrieving information in memory and performing calculations.).

With respect to claim 3:
	2A Prong 1:
	Claim 3 recites the following judicial exceptions:
 converting ... each state and each action in natural language form into those in logical form; and converting ... each state and each action in logical form into those in natural language form, wherein the transitional model is a logical transition model (mental process –can be performed in the human mind, or by a human using a pen and paper (e.g. a person may receive and record state and action data, convert the data between forms, and use logical models for state transitions.).

2A Prong 2: The additional elements recited in the claim do not integrate the judicial exception into a practical application.
	Additional elements:
by a semantic parser ... by a natural language generator (mere instructions to apply the exception or implement the exception on a computer (e.g. – computing by executing a specific module on data and storing data for use; see MPEP §2106.05(f).). The additional elements do not effectively integrate the abstract idea into a practical application. 2B: revisiting the additional elements, the additional elements do not amount to significantly more than the judicial exception – recited high level of generality and corresponds to storing and retrieving information in memory and performing calculations.).

With respect to claim 4:
	2A Prong 1:
	Claim 4 recites the following judicial exceptions:
 to generate a probability value for each action in an action space based on patterns in input data, wherein the input data comprises the current state and the modified estimate of the transition model (mental process –can be performed in the human mind, or by a human using a pen and paper (e.g. a person may receive and record state and action data, determine probabilities based on patterns using the current state data and adjusted estimate.).

2A Prong 2: The additional elements recited in the claim do not integrate the judicial exception into a practical application.
	Additional elements:
wherein the model-based RL computer model is trained, via a machine learning training operation (generally linking the use of a judicial exception to a particular technological environment or field of use (e.g. using goal-oriented learning to optimize actions; see MPEP §2106.05(h).). The additional elements do not effectively integrate the abstract idea into a practical application. 2B: revisiting the additional elements, the additional elements do not amount to significantly more than the judicial exception – recited high level of generality and corresponds to storing and retrieving information and performing calculations.).

With respect to claim 5:
	2A Prong 1:
	Claim 5 recites the following judicial exceptions:
 executing ... on the current state and the modified estimate of the transition model to generate the next action to take in the environment; and inputting the next action to take into the environment as a command for performing a corresponding action in the environment (mental process –can be performed in the human mind, or by a human using a pen and paper (e.g. a person may receive and record state and action data, and compute a next action to take based on the current state and the adjusted estimate, and further using the next action as a performance command.).

2A Prong 2: The additional elements recited in the claim do not integrate the judicial exception into a practical application.
	Additional elements:
wherein the model-based RL computer model is trained, via a machine learning training operation (generally linking the use of a judicial exception to a particular technological environment or field of use (e.g. using goal-oriented learning to optimize actions; see MPEP §2106.05(h).). The additional elements do not effectively integrate the abstract idea into a practical application. 2B: revisiting the additional elements, the additional elements do not amount to significantly more than the judicial exception – recited high level of generality and corresponds to storing and retrieving information and performing calculations.).

With respect to claim 6:
2A Prong 1:
	Claim 6 recites the following judicial exceptions:
 determine weights associated with one or more action operator components of a corresponding action (mental process –can be performed in the human mind, or by a human using a pen and paper (e.g. a person calculate and apply weight values to specific action operators.).



2A Prong 2: The additional elements recited in the claim do not integrate the judicial exception into a practical application.
	Additional elements:
wherein the model-based RL computer model comprises a logical neural network (LNN) for each action state in the action space, wherein each LNN is trained through the machine learning operation (generally linking the use of a judicial exception to a particular technological environment or field of use (e.g. using goal-oriented learning to optimize actions; see MPEP §2106.05(h).). The additional elements do not effectively integrate the abstract idea into a practical application. 2B: revisiting the additional elements, the additional elements do not amount to significantly more than the judicial exception – recited high level of generality and corresponds to storing and retrieving information and performing calculations.).

With respect to claim 7:
2A Prong 1:
	Claim 7 recites the following judicial exceptions:
 executing the next action to take in the environment; receiving updated observations from the environment; executing semantic parsing on the updated observations to generate an updated state of the environment that is set as the state in the pair of state and action, wherein the action in the pair of state and action is the next action to take; and repeating the estimating, modifying, and passing operations based on the updated pair of state and action (mental process –can be performed in the human mind, or by a human using a pen and paper (e.g. a person execute actions, record new observations, semantically process new observations, compute new states and actions, and repeating the estimating, adjusting, and passing based on the new data and observations.).

With respect to claim 8:
2A Prong 2: The additional elements recited in the claim do not integrate the judicial exception into a practical application.
	Additional elements:
wherein the environment is one of a monitored virtual environment provided by one or more computing systems, or a physical environment monitored by monitoring equipment (mere instructions to apply the exception or implement the exception on a computer (e.g. – receiving data and computing by executing a specific module on data and storing data for use; see MPEP §2106.05(f).). The additional elements do not effectively integrate the abstract idea into a practical application. 2B: revisiting the additional elements, the additional elements do not amount to significantly more than the judicial exception – recited high level of generality and corresponds to storing and retrieving information in memory and performing calculations.).

With respect to claim 9:
2A Prong 2: The additional elements recited in the claim do not integrate the judicial exception into a practical application.
Additional elements:
executing the next action to take in the environment, wherein the environment is a text based computer game environment or a chatbot environment, and wherein the next action to take is a natural language textual input (mere instructions to apply the exception or implement the exception on a computer (e.g. – receiving data and computing by executing a specific module on data and storing data for use; see MPEP §2106.05(f).). The additional elements do not effectively integrate the abstract idea into a practical application. 2B: revisiting the additional elements, the additional elements do not amount to significantly more than the judicial exception – recited high level of generality and corresponds to storing and retrieving information in memory and performing calculations.).

With respect to claim 10:
2A Prong 2: The additional elements recited in the claim do not integrate the judicial exception into a practical application.
Additional elements:
executing the next action to take in the environment, wherein the environment physical environment, and wherein the next action to take is a command to a robotic device to perform the next action within the physical environment (mere instructions to apply the exception or implement the exception on a computer (e.g. – receiving data and computing by executing a specific module on data and storing data for use; see MPEP §2106.05(f).). The additional elements do not effectively integrate the abstract idea into a practical application. 2B: revisiting the additional elements, the additional elements do not amount to significantly more than the judicial exception – recited high level of generality and corresponds to storing and retrieving information in memory and performing calculations.).

With respect to claim 11:
	Claim 11 substantially corresponds to claim 1 and is rejected under the same rationale.

With respect to claim 12:
	Claim 12 substantially corresponds to claim 2 and is rejected under the same rationale.

With respect to claim 13:
	Claim 13 substantially corresponds to claim 3 and is rejected under the same rationale.

With respect to claim 14:
	Claim 14 substantially corresponds to claim 4 and is rejected under the same rationale.

With respect to claim 15:
	Claim 15 substantially corresponds to claim 5 and is rejected under the same rationale.

With respect to claim 16:
	Claim 16 substantially corresponds to claim 6 and is rejected under the same rationale.

With respect to claim 17:
	Claim 17 substantially corresponds to claim 7 and is rejected under the same rationale.

With respect to claim 18:
	Claim 18 substantially corresponds to claim 8 and is rejected under the same rationale.

With respect to claim 19:
	Claim 19 substantially corresponds to claim 9 and is rejected under the same rationale.

With respect to claim 20:
	Claim 20 substantially corresponds to claim 1 and is rejected under the same rationale.

2B continued: After considering all claim elements individually and as an ordered combination, it is determined that the claims do not include any additional elements that are sufficient to amount to significantly more than the judicial exception.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1, 2, 4, 5, 7, 8-12, 14, 15, and 17-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Dalli et al., US Publication 2022/0147876 (“Dalli”), and further in view of Gordon et al., US Publication 2023/0311011 (“Gordon”).
Claim 1:
	Dalli teaches or suggests a computer-implemented method comprising:
	receiving, by a proprioception module, a previous state of an environment and a previous action taken by an agent in the environment (see para. 0036 - RL agent action histories and the construction of an SCM is rather universal and is adopted in this work for the creation of SEM equations; para. 0039 - SCM derived from an RL agent action history; para. 0104 - derived from the previously interacted environments in a novel environment. novel environment retains some characteristics of the previously interacted environments; para. 0190 - subsequently attempts to predict the drop-off location 6060 using historical drop-off location data. utilize a suitable interface to set the drop-off location. The drop-off time is estimated via a combination of historical trip duration data 6080, the estimated trip distance (that was calculated via 6060 and 6070), and two causal variables: a consumer index 6100, and the forecasted weather conditions.);
	estimating, by the proprioception module, a current state by using a transition model which receives a pair of state and action and produces a next state (see para. 0013 – capable of interacting with the environment such that some function Q is approximated from a stream of data points identified by <s, a, r, s'>; where s represents the state, which is an observation from the environment, a represents the action, r represents the reward from the environment which is a measure of how good the action is, ands' represents the next state when transitioning in the environment; para. 0023 - goal of Q learning is to approximate the Q function and use it to infer the optimal policy; para. 0086 - an action a for the current states, which leads to states'. The Process may be typically modelled using a Markov Decision Process (MOP) or a Partially Observable MOP (POMDP); para. 0092 - relationship by a given state-action pair (s,a) to its successors (s',a') may be described by the Bellman equation; para. 0095 - for the current state s; para. 0179 - child or adult being potentially hit by the driverless car; para. 0190 - estimate the trip distance and provide an estimate of the fare amount 6030, which may change dynamically throughout the trip as traffic conditions change. based on local environment conditions); 
	passing, by the proprioception module, the current state and the … estimate of the transition model to a model-based RL computer model for generation of a next action to take in the environment (see para. 0004 - Model-based RL may use algorithms to learn the model of the world, and then plan accordingly using the trained model. The algorithm updates the model and frequently re-plans in a planning-learning integration step; para. 0013 – capable of interacting with the environment such that some function Q is approximated from a stream of data points identified by <s, a, r, s'>; where s represents the state, which is an observation from the environment, a represents the action, r represents the reward from the environment which is a measure of how good the action is, ands' represents the next state when transitioning in the environment; para. 0023 - goal of Q learning is to approximate the Q function and use it to infer the optimal policy; para. 0086 - an action a for the current states, which leads to state s'. The Process may be typically modelled using a Markov Decision Process (MOP) or a Partially Observable MOP (POMDP); para. 0092 - relationship by a given state-action pair (s,a) to its successors (s',a') may be described by the Bellman equation; para. 0095 - for the current state s; para. 0175 - environmental feedback loop 1890 to update the simulator constraint model C 1870 dynamically and adaptively as the RL/XRL agent encounters novelties in the environment; para. 0179 - child or adult being potentially hit by the driverless car; para. 0190 - estimate the trip distance and provide an estimate of the fare amount 6030, which may change dynamically throughout the trip as traffic conditions change.).
	Gordon more specifically teaches or suggests modifying, by the proprioception module, an estimate of the transition model so that the modified estimate of the transition model prevents a past invalid action from reoccurring in a corresponding state, wherein the past invalid action taken in the corresponding state did not cause a change in state; and passing, by the proprioception module, … the modified estimate (see para. 0044 - provide healing of the expert data, generating healed data that corrects mistakes in the expert data, and/or generating healed data that corrects the suboptimal (unsuccessful) actions/decisions made by one or more trained action prediction machine-learning model(s); para. 0048 - prediction machine-learning model that has been trained using expert data and/or with healed data, which corrects mistakes made by the simulation-based quarterback AI, in addition to mistakes due to the predictions for passes generated by the action prediction machine-learning model itself; para. 0049 - passes made by the simulation based quarterback AI may be determined by the game engine to be suboptimal (e.g. an unsuccessful pass; para. 0050 - action prediction machine-learning to generate predictions for passes (instead of using the simulation-based quarterback AI), in which healing data may again be generated depending on the actions output by the action prediction machine learning model. In this way, the action prediction machine-learning model may be considered to be self-correcting as it can correct for mistakes made in its own predictions. Furthermore, this self-correction mechanism can enable the action prediction machine learning model to be adaptable to changes; para. 0073 - score generated by the simulation- based AI is adjusted to indicate that the selected action was suboptimal (or unsuccessful); para. 0151 – prediction generated by the action prediction machine-learning model 704 is adjusted to indicate that the selected action was suboptimal (or unsuccessful).).
Accordingly, it would have been obvious to one having ordinary skill before the effective filing date of the claimed invention to modify the system and method, taught in Dalli, to include modifying, by the proprioception module, an estimate of the transition model so that the modified estimate of the transition model prevents a past invalid action from reoccurring in a corresponding state, wherein the past invalid action taken in the corresponding state did not cause a change in state; and passing, by the proprioception module, … the modified estimate for the purpose of efficiently adjusting elements associated with a model to correct or prevent mistakes based on past action determinations, improving model performance, as taught by Gordon (0044 and 0050).
Claim(s) 11 and 20:
Claim(s) 11 and 20 correspond to Claim 1, and thus, Dali and Gordon teach or suggest the limitations of claim(s) 11 and 20 as well.

Claim 2:
	Dalli further teaches or suggests storing, by the proprioception module, received past state-action pairs in a memory; and … state-action pairs in the memory, and labeling, by the proprioception module (see para. 0004 - Model-based RL may use algorithms to learn the model of the world, and then plan accordingly using the trained model. The algorithm updates the model and frequently re-plans in a planning-learning integration step; para. 0013 – capable of interacting with the environment such that some function Q is approximated from a stream of data points identified by <s, a, r, s'>; where s represents the state, which is an observation from the environment, a represents the action, r represents the reward from the environment which is a measure of how good the action is, ands' represents the next state when transitioning in the environment; para. 0023 - goal of Q learning is to approximate the Q function and use it to infer the optimal policy; para. 0086 - an action a for the current states, which leads to state s'. The Process may be typically modelled using a Markov Decision Process (MOP) or a Partially Observable MOP (POMDP); para. 0092 - relationship by a given state-action pair (s,a) to its successors (s',a') may be described by the Bellman equation; para. 0095 - for the current state s; para. 0146 - record of states, actions, and rewards stored in a sequential manner. The dataset may be created by the agent during the training of the RL system; para. 0175 - environmental feedback loop 1890 to update the simulator constraint model C 1870 dynamically and adaptively as the RL/XRL agent encounters novelties in the environment; para. 0179 - child or adult being potentially hit by the driverless car; para. 0190 - estimate the trip distance and provide an estimate of the fare amount 6030, which may change dynamically throughout the trip as traffic conditions change; para. 0225 - one or more associated labels which may indicate the output value or classification for a specific data point or a continuous or non-continuous interval of data points.).
	Gordon more specifically teaches or suggests as valid if a change in the next state was induced and invalid of no change was observed (see para. 0044 - provide healing of the expert data, generating healed data that corrects mistakes in the expert data, and/or generating healed data that corrects the suboptimal (unsuccessful) actions/decisions made by one or more trained action prediction machine-learning model(s); para. 0048 - prediction machine-learning model that has been trained using expert data and/or with healed data, which corrects mistakes made by the simulation-based quarterback AI, in addition to mistakes due to the predictions for passes generated by the action prediction machine-learning model itself; para. 0049 - passes made by the simulation based quarterback AI may be determined by the game engine to be suboptimal (e.g. an unsuccessful pass; para. 0050 - action prediction machine-learning to generate predictions for passes (instead of using the simulation-based quarterback AI), in which healing data may again be generated depending on the actions output by the action prediction machine learning model. In this way, the action prediction machine-learning model may be considered to be self-correcting as it can correct for mistakes made in its own predictions. Furthermore, this self-correction mechanism can enable the action prediction machine learning model to be adaptable to changes; para. 0073 - score generated by the simulation- based AI is adjusted to indicate that the selected action was suboptimal (or unsuccessful); para. 0151 – prediction generated by the action prediction machine-learning model 704 is adjusted to indicate that the selected action was suboptimal (or unsuccessful).).
Accordingly, it would have been obvious to one having ordinary skill before the effective filing date of the claimed invention to modify the system and method, taught in Dalli, to include as valid if a change in the next state was induced and invalid of no change was observed for the purpose of efficiently adjusting elements associated with a model to correct or prevent mistakes based on past action determinations, improving model performance, as taught by Gordon (0044 and 0050).
Claim(s) 12:
Claim(s) 12 correspond to Claim 2, and thus, Dali and Gordon teach or suggest the limitations of claim(s) 12 as well.

Claim 4:
	Dalli further teaches or suggests wherein the model-based RL computer model is trained, via a machine learning training operation, to generate a probability value for each action in an action space based on patterns in input data, wherein the input data comprises the current state and the … estimate of the transition model (see para. 0004 - Model-based RL may use algorithms to learn the model of the world, and then plan accordingly using the trained model. The algorithm updates the model and frequently re-plans in a planning learning integration step; para. 0013 – goal of Q-Leaming is to train an agent which is capable of interacting with the environment such that some function Q is approximated from a stream of data points identified by <s, a, r, s'>; where s represents the state, which is an observation from the environment, a represents the action, r represents the reward from the environment which is a measure of how good the action is, ands' represents the next state when transitioning in the environment; para. 0023 - neural network models the probability for each action where its objective is to maximize the total future expected rewards; para. 0094 - Where p(s'rls,a) defines the conditional probability of a reward given for a state and action; para. 0187 - conditional and non-conditional probability distributions, joint probability distributions; para. 0146 - record of states, actions, and rewards stored in a sequential manner. The dataset may be created by the agent during the training of the RL system.).
	Gordon further teaches or suggests modified estimate (see (see para. 0044 - provide healing of the expert data, generating healed data that corrects mistakes in the expert data, and/or generating healed data that corrects the suboptimal (unsuccessful) actions/decisions made by one or more trained action prediction machine-learning model(s); para. 0048 - prediction machine-learning model that has been trained using expert data and/or with healed data, which corrects mistakes made by the simulation-based quarterback AI, in addition to mistakes due to the predictions for passes generated by the action prediction machine-learning model itself; para. 0049 - passes made by the simulation based quarterback AI may be determined by the game engine to be suboptimal (e.g. an unsuccessful pass; para. 0050 - action prediction machine-learning to generate predictions for passes (instead of using the simulation-based quarterback AI), in which healing data may again be generated depending on the actions output by the action prediction machine learning model. In this way, the action prediction machine-learning model may be considered to be self-correcting as it can correct for mistakes made in its own predictions. Furthermore, this self-correction mechanism can enable the action prediction machine learning model to be adaptable to changes; para. 0073 - score generated by the simulation- based AI is adjusted to indicate that the selected action was suboptimal (or unsuccessful); para. 0151 – prediction generated by the action prediction machine-learning model 704 is adjusted to indicate that the selected action was suboptimal (or unsuccessful).).
Accordingly, it would have been obvious to one having ordinary skill before the effective filing date of the claimed invention to modify the system and method, taught in Dalli, to include modified estimate for the purpose of efficiently adjusting elements associated with a model to correct or prevent mistakes based on past action determinations, improving model performance, as taught by Gordon (0044 and 0050).
Claim(s) 14:
Claim(s) 14 correspond to Claim 4, and thus, Dali and Gordon teach or suggest the limitations of claim(s) 14 as well.

Claim 5:
	Dalli further teaches or suggests executing the model-based RL computer model on the current state and the … estimate of the transition model to generate the next action to take in the environment; and inputting the next action to take into the environment as a command for performing a corresponding action in the environment (see para. 0004 - Model-based RL may use algorithms to learn the model of the world, and then plan accordingly using the trained model. The algorithm updates the model and frequently re-plans in a planning-learning integration step; para. 0013 – capable of interacting with the environment such that some function Q is approximated from a stream of data points identified by <s, a, r, s'>; where s represents the state, which is an observation from the environment, a represents the action, r represents the reward from the environment which is a measure of how good the action is, ands' represents the next state when transitioning in the environment; para. 0023 - goal of Q learning is to approximate the Q function and use it to infer the optimal policy; para. 0086 - an action a for the current states, which leads to state s'. The Process may be typically modelled using a Markov Decision Process (MOP) or a Partially Observable MOP (POMDP); para. 0092 - relationship by a given state-action pair (s,a) to its successors (s',a') may be described by the Bellman equation; para. 0095 - for the current state s; para. 0175 - environmental feedback loop 1890 to update the simulator constraint model C 1870 dynamically and adaptively as the RL/XRL agent encounters novelties in the environment; para. 0179 - child or adult being potentially hit by the driverless car. Take an appropriate safety related action, such as to switch the car to a safer driving mode, slow down, or turn on some higher resolution sensor to better resolve the future possibilities prior to them actually occurring; para. 0190 - XRL embedded controller application for an autonomous vehicle that provides paid taxi services. agent may thus make independent decisions and optimizations based on local environment conditions in combination with centralized decisions and optimizations based on a regional (set of local environments) or global environment.).
	Gordon further teaches or suggests modified estimate (see (see para. 0044 - provide healing of the expert data, generating healed data that corrects mistakes in the expert data, and/or generating healed data that corrects the suboptimal (unsuccessful) actions/decisions made by one or more trained action prediction machine-learning model(s); para. 0048 - prediction machine-learning model that has been trained using expert data and/or with healed data, which corrects mistakes made by the simulation-based quarterback AI, in addition to mistakes due to the predictions for passes generated by the action prediction machine-learning model itself; para. 0049 - passes made by the simulation based quarterback AI may be determined by the game engine to be suboptimal (e.g. an unsuccessful pass; para. 0050 - action prediction machine-learning to generate predictions for passes (instead of using the simulation-based quarterback AI), in which healing data may again be generated depending on the actions output by the action prediction machine learning model. In this way, the action prediction machine-learning model may be considered to be self-correcting as it can correct for mistakes made in its own predictions. Furthermore, this self-correction mechanism can enable the action prediction machine learning model to be adaptable to changes; para. 0073 - score generated by the simulation- based AI is adjusted to indicate that the selected action was suboptimal (or unsuccessful); para. 0151 – prediction generated by the action prediction machine-learning model 704 is adjusted to indicate that the selected action was suboptimal (or unsuccessful).).
Accordingly, it would have been obvious to one having ordinary skill before the effective filing date of the claimed invention to modify the system and method, taught in Dalli, to include modified estimate for the purpose of efficiently adjusting elements associated with a model to correct or prevent mistakes based on past action determinations, improving model performance, as taught by Gordon (0044 and 0050).
Claim(s) 15:
Claim(s) 15 correspond to Claim 5, and thus, Dali and Gordon teach or suggest the limitations of claim(s) 15 as well.

Claim 7:
	Dalli further teaches or suggests executing the next action to take in the environment; receiving updated observations from the environment; executing … on the updated observations to generate an updated state of the environment that is sent as the sate in the pair of state and action, wherein the action in the pair of state and action is the next action to take; and repeating the estimating … and passing operations based on the updated pair of state and action (see para. 0009 - state 206 of the RL agent may be a numeric representation, which is either discrete or continuous, of what an agent observes at a particular time in the environment; para. 0013 - where s represents the state, which is an observation from the environment, a represents the action, r represents the reward from the environment which is a measure of how good the action is, ands' represents the next state when transitioning in the environment; para. 0102 - RL agents are capable of representing simulatable models that essentially consist of observed and predicted states pairs. observed space is the normal RL state space together with the observations arriving from the environment and any other applicable external entities or objects; para. 0175 - environmental feedback loop 1890 to update the simulator constraint model C 1870 dynamically and adaptively as the RL/XRL agent encounters novelties in the
Environment; para. 0179 - When the safety control system in the driverless car detects the anticipated references, it may take an appropriate safety related action, such as to switch the car to a safer driving mode, slow down, or turn on some higher resolution sensor to better resolve the future possibilities prior to them actually occurring; para. 0190 - receiving information from the environment via real-time traffic data 6090 received from an Intelligent Transport System (ITS) and from its onboard taxi meter 6140 that is receiving internal sensor data combined with Global Navigational Satellite System (GNSS) data. estimate the trip distance and provide an estimate of the fare amount 6030, which may change dynamically throughout the trip as traffic conditions change; para. 0220 - planner may dynamically re-plan the plan as the XRL agent progresses along in its environmental interactions, using a suitable prediction-error or predictive coding- based monitoring and control system; para. 0225 - XRL may be used in conjunction with AI models that process sequence data. Sequence data may include a number of data points which contain feature data in various sequential formats including, but not limited to a combination of one or more of … natural language text … one or more associated labels which may indicate the output value or classification for a specific data point or a continuous or non-continuous interval of data points.).
Claim(s) 17:
Claim(s) 17 correspond to Claim 7, and thus, Dali and Gordon teach or suggest the limitations of claim(s) 17 as well.

Claim 8:
	Dalli further teaches or suggests wherein the environment is one of a monitored virtual provided by one or more computing systems, or a physical environment monitored by monitoring equipment (see para. 0084 - combination of events and actions with the environment may be considered a simulation. Realistic simulation may include collecting the right amount of sensor/event data in order to take the optimal action; para. 0086 – RL agent may have a simulation of the environment used in the action selection process that may be hidden or explicit; para. 0175 - C may implement a combination of pre-defined knowledge priors, physical world models such as physics simulation models. environmental feedback loop 1890 to update the simulator constraint model C 1870 dynamically and adaptively as the RL/XRL agent encounters novelties in the
environment; para. 0176 - agent has four main methods of improving on its initial configuration to gain adaptivity: (i.) via self training against simulation; (ii .) self-training using interventional actions against simulation; (iii.) active learning against the live environment; (iv.) active learning using interventional actions against the live environment; para. 0179 - resolution sensor to better resolve the future possibilities prior to them actually occurring; para. 0190 - receiving information from the environment via real-time traffic data 6090 received from an Intelligent Transport System (ITS) and from its onboard taxi meter 6140 that is receiving internal sensor data combined with Global Navigational Satellite System (GNSS) data; para. 0225 - sensor data, image data; para. 0227 - to generate realistic environmental simulations.).
Claim(s) 18:
Claim(s) 18 correspond to Claim 8, and thus, Dali and Gordon teach or suggest the limitations of claim(s) 18 as well.

Claim 9:
	Gordon further teaches or suggests executing the next action to take in the environment, wherein the environment is a text based computer game environment or a chatbot environment, and wherein the next action to take is a natural language textual input (see para. 0180 - applied for use with any type of one or more video game entities in any type of video game as the application demands. The updated and/or trained action prediction machine-learning model may be deployed within a video game for use in controlling a video game entity (e.g. a non-player character, vehicle or any other controllable entity) of a video game during game play. supplemental information may be used to suggest or recommend an action that the video game entity may take by the human player. These predictions may be provided or displayed via the video game entity controlled by each human player for highlighting or recommending, without limitation, the best or optimal action the video game entity may perform and/or provide options such as a ranking the most successful/optimal actions a human player may select for the video game entity to perform. In other embodiments, these predictions may be provided or displayed via a user interface of the video game to a player controlling the video game entity.).
Accordingly, it would have been obvious to one having ordinary skill before the effective filing date of the claimed invention to modify the system and method, taught in Dalli, to include executing the next action to take in the environment, wherein the environment is a text based computer game environment or a chatbot environment, and wherein the next action to take is a natural language textual input for the purpose of efficiently adjusting elements associated with a model to correct or prevent mistakes based on past action determinations, improving model performance, as taught by Gordon (0044 and 0050).
Claim(s) 19:
Claim(s) 19 correspond to Claim 9, and thus, Dali and Gordon teach or suggest the limitations of claim(s) 19 as well.

Claim 10:
	Dalli further teaches or suggests executing the next action to take in the environment, wherein the environment is a physical environment, and wherein the next action to take is a command to a robotic device to perform the next action within the physical environment (see para. 0084 - combination of events and actions with the environment may be considered a simulation. Realistic simulation may include collecting the right amount of sensor/event data in order to take the optimal action; para. 0086 – RL agent may have a simulation of the environment used in the action selection process that may be hidden or explicit; para. 0175 - C may implement a combination of pre-defined knowledge priors, physical world models such as physics simulation models. environmental feedback loop 1890 to update the simulator constraint model C 1870 dynamically and adaptively as the RL/XRL agent encounters novelties in the environment; para. 0176 - agent has four main methods of improving on its initial configuration to gain adaptivity: (i.) via self training against simulation; (ii .) self-training using interventional actions against simulation; (iii.) active learning against the live environment; (iv.) active learning using interventional actions against the live environment; para. 0179 - resolution sensor to better resolve the future possibilities prior to them actually occurring; para. 0190 - receiving information from the environment via real-time traffic data 6090 received from an Intelligent Transport System (ITS) and from its onboard taxi meter 6140 that is receiving internal sensor data combined with Global Navigational Satellite System (GNSS) data; para. 0225 - sensor data, image data; para. 0227 - to generate realistic environmental simulations.).

Claim(s) 3 and 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Dalli, in view of Gordon, in view of Valencia et al., US Publication 2020/0191896 (“Valencia”), and further in view of Tang et al., US Publication 2021/0319344 (“Tang”).        
Claim 3:
	Dalli further teaches or suggests wherein the transition model is a logical transition model (see para. 0149 - XRL agent may implement a combination of workflows, process flows, process description, state-transition charts, Petri networks, electronic circuits, logic gates; para. 0175 - may be a symbolic model that generates simulated or synthetic data via a set of rules, a formal language, neuro-symbolic model, logical model, or other form of symbolic model. C may implement symbolic information as a set of rules, a formal language, neuro-symbolic model, logical model, or other form of symbolic model. environmental feedback loop 1890 to update the simulator constraint model C 1870 dynamically and adaptively as the RL/XRL agent encounters novelties in the environment.).
	Valencia further teaches or suggests converting, by a natural language generator, each state and each action in logical form into those in natural language form (see para. 0040 - the action includes at least one of: using a tag output device to at least one of: at least partially control equipment; at least partially control the object; and, generate a notification; and, transferring an action message to a client device, the client device being responsive to the action message to at least one of: generate a notification; perform an action; para. 0131 – parsing using natural language processing, to identify logic expressions and object types. An object identifier is then determined for each object type, either by retrieving these based on an object type of the object or by generating these as needed. The logic expressions are then used to generate the object rules by converting the logic expressions into a trigger event and an action.).
Accordingly, it would have been obvious to one having ordinary skill before the effective filing date of the claimed invention to modify the system and method, taught in Dalli, to include converting, by a natural language generator, each state and each action in logical form into those in natural language form for the purpose of efficiently determining triggering events and actions based on logic expressions, improving object operation, as taught by Valencia (0040 and 0131).
Tang further teaches or suggests converting, by a semantic parser, each state and each action in natural language form into those in logical form (see Abstract and para. 0002 – question in a natural language conversation is received and converted into a logical representation corresponding to semantics of the question, the logical representation including a first sequence of actions executable on a knowledge base. understand the semantics of a question in a multi-round conversation, so as to convert the questions into a sequence of actions executable on a large-scale knowledge base. In this way, the solution can effectively improve accuracy and efficiency; para. 0020 - question in a natural language multi-round conversation is converted, by a trained neural network model, into a logical representation of semantics corresponding to the question, the logical representation including a sequence of actions executable on a large-scale knowledge base. model executes semantic parsing on questions; para. 0029 - convert the question 270 into a logical representation corresponding to the semantics of the question 270, where the logical representation can include a sequence of actions executable on a large-scale knowledge base.).
Accordingly, it would have been obvious to one having ordinary skill before the effective filing date of the claimed invention to modify the system and method, taught in Dalli, to include converting, by a semantic parser, each state and each action in natural language form into those in logical form for the purpose of efficiently determining a sequence of actions to execute, improving model accuracy and efficiency, as taught by Tang (0002 and 0029).
Claim(s) 13:
Claim(s) 13 correspond to Claim 3, and thus, Dali, Gordon, Valencia, and Tang teach or suggest the limitations of claim(s) 13 as well.

Claim(s) 6 and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Dalli, in view of Gordon, and further in view of Jung et al., US Publication 2021/0232857 (“Jung”).
Claim 6:
	Dalli further teaches or suggests wherein the model-based RL computer model comprises a … neural network … for each action in the action space, and wherein each … is trained through the machine learning operation to determine weights associated with one or more action operator components of a corresponding action (see Fig. 1, 6, 12; para. 0009 - state 206 of the RL agent may be a numeric representation, which is either discrete or continuous, of what an agent observes at a particular time in the environment. The set of all possible states is called state space; para. 0010 - action 208 is the input provided to the environment, calculated by applying a policy to the current state. This may be discrete or continuous. The set of all possible actions is called action space; para. 0012 - using a Deep Leaming Network 214 or other form of deep learning architecture or other suitable form of connectionist architecture, as illustrated in FIG. 1, or using some other suitable method; para. 0013 - where s represents the state, which is an observation from the environment, a represents the action, r represents the reward from the environment which is a measure of how good the action is, ands' represents the next state when transitioning in the environment; para. 0019 - Deep RL (DRL) is the combination of RL and neural networks. Deep learning may be used to learn to approximate a function which compresses the raw input features to high-level features; para. 0022 - During each update, one set of weights is used to determine the greedy policy and the other to determine its value. Other variants of Deep QL, known as Double Deep Q-Network (DQN), includes using two neural networks to perform the Bellman iteration, one for generating the prediction and another one for generating the target; para. 0052 - XNN/INNs may utilize and are fully compatible with all current DL libraries and architectures, allowing them to take advantage of all performance advancements available for DLs. XNN/INN architecture may be combined into a bigger deep learning system. For example, they may be integrated into an existing Q-learning function which is based on deep learning; para. 0101 - current values within the DXQL agent's Q-learning weights.).
	Jung more specifically teaches or suggests logical neural network (LNN) … LNN (see para. 0010 - device configured to apply a logical constraint to a neural network model is provided; para. 0044 - apply a logical constraint to a neural network model, and so long as the device is configured to perform each stage of the controlling method as described below; para. 0073 - training of the neural network model 1000 based on the score information refers to the first feature value being updated to include the logically certain information rather than the uncertain information included in the first feature value by updating a parameter (weight value) of the neural network model; para. 0076 - the neural network model 1000 may provide a result matching the logic rule with respect to the problem of whether following the logical inference is appropriate. Accordingly, reliability of the neural network model 1000 may be raised, and also the use range of the neural network model 1000 may be expanded; para. 0160 - learning algorithm may be a method which uses a plurality of learning data to train a predetermined subject machine (e.g., robot) to determine or predict on its own. or a reinforcement learning.).
Accordingly, it would have been obvious to one having ordinary skill before the effective filing date of the claimed invention to modify the system and method, taught in Dalli, to include logical neural network (LNN) … LNN for the purpose of efficiently applying logic to a neural network and providing results matching the logic, expanding the use range and reliability of a neural network, as taught by Jung (0076 and 0160).
Claim(s) 16:
Claim(s) 16 correspond to Claim 6, and thus, Dali, Gordon, and Jung teach or suggest the limitations of claim(s) 16 as well.










Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Andrew T McIntosh whose telephone number is (571)270-7790. The examiner can normally be reached M-Th 8:00am-5:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Tamara Kyle can be reached at 571-272-4241. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ANDREW T MCINTOSH/Primary Examiner, Art Unit 2144
Read full office action
Prosecution Timeline

Dec 02, 2022
Application Filed
Dec 11, 2025
Non-Final Rejection — §101, §103
Mar 19, 2026
Applicant Interview (Telephonic)
Mar 19, 2026
Examiner Interview Summary
Precedent Cases

Applications granted by this same examiner with similar technology

15/415,693
Patent 12602534
Method and System to Display Content from a PDF Document on a Small Screen
2y 5m to grant Granted Apr 14, 2026
18/608,836
Patent 12596757
NATIVE INTEGRATION OF ARBITRARY DATA SOURCES
2y 5m to grant Granted Apr 07, 2026
18/353,083
Patent 12572617
SYSTEM AND METHOD FOR THE GENERATION AND EDITING OF TEXT CONTENT IN WEBSITE BUILDING SYSTEMS
2y 5m to grant Granted Mar 10, 2026
17/867,680
Patent 12561191
TRAINING METHOD AND APPARATUS FOR FAULT RECOGNITION MODEL, FAULT RECOGNITION METHOD AND APPARATUS, AND ELECTRONIC DEVICE
2y 5m to grant Granted Feb 24, 2026
17/245,541
Patent 12547874
DEPLOYING PARALLELIZABLE DEEP LEARNING MODELS BY ADAPTING TO THE COMPUTING DEVICES
2y 5m to grant Granted Feb 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
77%
Grant Probability
95%
With Interview (+18.0%)
3y 0m
Median Time to Grant
Low
PTA Risk
Based on 511 resolved cases by this examiner. Grant probability derived from career allow rate.