Last updated: April 19, 2026
Application No. 17/523,553
REINFORCEMENT LEARNING WITH INDUCTIVE LOGIC PROGRAMMING

Final Rejection §102§103
Filed
Nov 10, 2021
Examiner
MAHARAJ, DEVIKA S
Art Unit
2123
Tech Center
2100 — Computer Architecture & Software
Assignee
International Business Machines Corporation
OA Round
2 (Final)
Interview Optional

— +7.7% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 78 resolved cases, 2023–2026
Examiner Intelligence

MAHARAJ, DEVIKA S View full profile →
Grants 55% of resolved cases
Career Allow Rate
43 granted / 78 resolved
At TC average
Moderate +8% lift
Without
With
+7.7%
Interview Lift
resolved cases with interview
Typical timeline
5y 0m
Avg Prosecution
28 currently pending
Career history
106
Total Applications
across all art units
Statute-Specific Performance

§101
27.4%
-12.6% vs TC avg
§103
42.8%
+2.8% vs TC avg
§102
10.1%
-29.9% vs TC avg
§112
16.6%
-23.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 78 resolved cases
Office Action

§102 §103
DETAILED ACTION
1.	This communication is in response to the arguments/remarks filed on December 19, 2025 for Application No. 17/523,553 in which Claims 1-20 are presented for examination. 

Notice of Pre-AIA  or AIA  Status
2.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
3.	The information disclosure statement submitted on 09/22/2025 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.

Response to Arguments
4.	The arguments/remarks filed on December 19, 2025 have been considered. No claims have been amended. Thus, Claims 1-20 are pending and presented for examination.

5.	Applicant’s arguments filed December 19, 2025 did not include a further response regarding the 35 U.S.C. 112(f) interpretation, therefore, the 35 U.S.C. 112(f) interpretation is maintained. 

6.	Applicant’s arguments filed December 19, 2025 with respect to the 35 U.S.C. 101 rejection have been fully considered and are persuasive.  Thus, the 35 U.S.C. 101 rejection has been withdrawn. 

7.	Applicant's arguments filed December 19, 2025 with respect to the 35 U.S.C. 102/103 rejections have been fully considered but they are not persuasive. 
	Applicant’s Arguments on Pg. 9 of Arguments/Remarks state:
    PNG
    media_image1.png
    298
    662
    media_image1.png
    Greyscale

Examiner respectfully disagrees. The term “reward logic neural network”, as recited by the Independent claims, does not have a plain meaning. Furthermore, the term “logic neural network” (LNN) is merely recited within Applicant’s specification and does not refer to any specific standard definition and/or meaning. To reiterate, Applicant’s specification even states in Par. [0046]: “In general, an LNN may be implemented as a form of recurrent neural network with a one- to-one correspondence to logical formulae in a system of weighted, real-valued logic. Evaluation of the LNN performs a logical inference.” Hence, according to Applicant’s specification, the LNN may be interpreted as a form of a recurrent neural network and the evaluation of the LNN performs a logical inference. Therefore, Examiner’s interpretation of the “reward logic neural network” comprising the recurrent neural network of Smolyanskiy (supported by Par. [0074] of Smolyanskiy) is valid, as the recurrent neural network performs the same operations as Applicant’s claimed “reward logic neural network” and performs a logical inference, as supported by Applicant’s specification. Examiner asserts that if Applicant is instead referring to a particular type of neural network (i.e., logical neural network including particular logic gates), this should be reflected properly within the claim language, as the currently drafted claims recite a “reward logic neural network” and “safety logic neural network” which are merely labels and not equivalent to a logical neural network, contrary to Applicant’s arguments above. 

Applicant’s Arguments on Pg. 10 of Arguments/Remarks state:

    PNG
    media_image2.png
    558
    642
    media_image2.png
    Greyscale

Examiner respectfully disagrees. Applicant’s claim language states that the plurality of MDPs are learned broadly “using reinforcement learning in respective training environments” – further, Smolyanskiy Par. [0031] states that a model-based reinforcement learning framework is used and formulated as a Markov Decision Process with the DNN as the transition model – this framework is utilized to address rare and/or unsafe events that may not be properly captured in the training data (i.e., vehicle cut-ins, harsh braking leading to collisions, etc.). Therefore, the model-based reinforcement learning framework may be broadly used in different respective training environments (for evaluating different types of rare and/or unsafe events), hence a plurality of MDPs would be learned, with each MDP associated with a different version of the DNN during training to handle these rare/unseen/unsafe events. 
Further, Applicant states above that Schmidt does not disclose the limitations of Claim 2, particularly that Schmidt does not disclose “logic rules” at all. This is not true. Schmidt Pg. 1 clearly states “This paper provides a pipeline that builds on this technique to create policies that are both safe and interpretable. The pipeline trains a non-interpretable RL agent for safe behavior, modifying existing reward structures and training techniques. It then extracts a set of rules approximating the policy in the form of a decision tree.” – with this just being one example of such a recitation of extracting rules within the disclosure of Schmidt. 

	Applicant’s Arguments on Pgs. 10-11 of Arguments/Remarks state:

    PNG
    media_image3.png
    264
    644
    media_image3.png
    Greyscale

	
    PNG
    media_image4.png
    244
    654
    media_image4.png
    Greyscale

Examiner respectfully disagrees for substantially the same reasons as stated above. Further, regarding Applicant’s argument that Smolyanskiy does not produce a “safety score”, Examiner points to Smolyanskiy Par. [0118] which states that the value function MLM may be used to generate predictions of one or more scores of a value function – these scores may correspond to one or more driving policies which relate directly to safety (See Smolyanskiy Par. [0031-0032] which describes how the system determines a proposed action is safe/unsafe). Examiner also asserts that the term “safety score” is recited broadly and not given a clear definition within the context of the claim limitations, simply that the safety LNN “generates a safety score based on the state-action pair” without significantly more. Applicant should consider amending the claims further to clarify the “safety score” and avoid broad interpretation. 
Applicant's arguments fail to comply with 37 CFR 1.111(b) because they amount to a general allegation that the claims define a patentable invention without specifically pointing out how the language of the claims patentably distinguishes them from the references.
Thus, the 35 U.S.C. 102/103 rejections are maintained. 

Claim Interpretation
8.	The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

9.	The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
10.	This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  
Such claim limitation(s) is/are: “driving system” in Claims 14-20
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 102
11.	The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

12.	Claims 7-20 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Smolyanskiy et al. (hereinafter Smolyanskiy) (US PG-PUB 20220138568).
Regarding Claim 7, Smolyanskiy teaches a computer-implemented method for automated motion (Smolyanskiy, Abstract, “In various examples, reinforcement learning is used to train at least one machine learning model (MLM) to control a vehicle by leveraging a deep neural network (DNN) trained on real-world data by using imitation learning to predict movements of one or more actors to define a world model.”, therefore, a method for automated motion of a vehicle is disclosed), comprising: 
determining a state of an environment using a sensor on a vehicle (Smolyanskiy, Par. [0025], “In one or more embodiments, a set of training data used to train the DNN may be generated using one or more real-world vehicle sensors indicative of movements of the one or more actors in the environment over a period of time. The set of training data may be generated using any combination of various types of vehicle sensors (such as cameras, LiDAR sensors, radar sensors, etc.) from various different vehicles in numerous different real-world situations. Traffic is difficult to model, so real-world sensor data may be used to train the DNN to learn how the actors will change.”, therefore, real-world vehicle sensors are used to determine a state of an environment); 
determining a proposed action, based on the state, using a reward logic neural network (LNN) that generates a reward score based on a state-action pair (Smolyanskiy, Par. [0036], “The policy MLM 106 may use data corresponding to the output from the prediction MLM 104 to generate one or more predictions corresponding to one or more actions for a vehicle, such as the vehicle 1100.”, therefore, the policy MLM functions as the reward logic neural network (the policy MLM may be implemented as a recurrent neural network per Smolyanskiy Par. [0074] – this is supported by Applicants dependent claim 13, which states that the logic neural networks may be implemented as recurrent neural networks) which is able to determine a proposed action based on an inputted state. Furthermore, Smolyanskiy Par. [0029] states “In one or more embodiments, reinforcement learning used to train a policy network may apply a value function, which may be evaluated based at least on one or more of the states of the environment predicted by the DNN, to assign a score(s) to the predictions made by the MLM. For example, rewards may be associated with one or more goals of the policy network and penalties may be associated with collisions or other predicted or inferred states of the network.” – hence, a reward score is also generated based on the state-action pair); 
determining that the proposed action is safe, using a safety LNN that generates a safety score based on the state-action pair (Smolyanskiy, Par. [0036], “The value function MLM 108 may also use data corresponding to the output from the prediction MLM 104 to generate one or more predictions corresponding to one or more scores for the one or more actions predicted using the policy MLM”, therefore, the value function MLM functions as the safety LNN (the value function MLM may be implemented as a recurrent neural network per Smolyanskiy Par. [0074] – this is supported by Applicants dependent claim 13, which states that the logic neural networks may be implemented as recurrent neural networks) which is able to generate a score based on the state-action pair which is outputted by the policy MLM. See Smolyanskiy Par. [0032] which describes how the system also determines that a proposed action is safe/unsafe & Par. [0057] which describes the use of a “trajectory scorer” to score predicted actions); and 
automatically performing the proposed action on the vehicle (Smolyanskiy, Par. [0058], “The controller 156 may cause control of the vehicle 1100 in accordance with a select and/or optimized path from the optimizer 166. In some embodiments, the controller 156 may directly control the actions of the vehicle 1100, such as accelerating, braking, turning, etc. For example, the controller 156 may control a brake actuator 1148, a propulsion system 1150 and/or throttle 1152, a steering system 1154 and/or steering actuator 1156, and/or other components of the vehicle 1100 (such as illustrated in FIG. 11A). In other embodiments, the controller 156 may indirectly control the actions of the vehicle 1100, such as by sending a message or instruction to another system of the vehicle 1100.”, thus, the proposed action may be automatically performed on the vehicle).

Regarding Claim 8, Smolyanskiy teaches the method of claim 7, wherein determining that the proposed action is safe includes comparing the safety score to a threshold (Smolyanskiy, Par. [0098], “The behavior planner may determine the feasibility of basic behaviors of the vehicle 1100, such as staying in the lane or changing lanes left or right, so that the feasible behaviors may be matched up with the most desired behaviors output from the lane planner. For example, if the desired behavior is determined to not be safe and/or available, a default behavior may be selected instead (e.g., default behavior may be to stay in lane when desired behavior or changing lanes is not safe).”, therefore, the safety score may be compared to a desired behavior (threshold) to determine if the proposed action is safe/unsafe)

Regarding Claim 9, Smolyanskiy teaches the method of claim 7, further comprising determining a first action, before determining the proposed action, having a higher reward score than the reward score of the proposed action (Smolyanskiy, Par. [0027], “RL techniques can be used with this simulation to train the policy network according to one or more policies. For example, a policy network may learn how to plan the best action(s) for the vehicle given a goal(s) (e.g., as an input or implicit in training) and traffic information encoded by the DNN. Examples of goals may include reaching a certain destination, following another vehicle, etc. When a motion planner that leverages predictions made by the policy network chooses one or more actions, it may interact with the traffic model and the predicted future trajectories (e.g., future traffic motions for other actors) and may change or update the actions accordingly. This way possibilities in the simulator can be replayed for all possible futures that depend on the actions of the planner, and the policy network can be trained to reach optimal states (and/or avoid bad states) given the goal.”, therefore, a first action may have a higher reward score than the reward score of the proposed/predicted action, as the first action may have rewards associated with one or more goals of the policy network – including goals such as reaching a certain destination, following a certain vehicle, etc.).

Regarding Claim 10, Smolyanskiy teaches the method of claim 9, further comprising determining that the first action has a safety score below the threshold before determining the proposed action (Smolyanskiy, Par. [0027], “RL techniques can be used with this simulation to train the policy network according to one or more policies. For example, a policy network may learn how to plan the best action(s) for the vehicle given a goal(s) (e.g., as an input or implicit in training) and traffic information encoded by the DNN. Examples of goals may include reaching a certain destination, following another vehicle, etc. When a motion planner that leverages predictions made by the policy network chooses one or more actions, it may interact with the traffic model and the predicted future trajectories (e.g., future traffic motions for other actors) and may change or update the actions accordingly. This way possibilities in the simulator can be replayed for all possible futures that depend on the actions of the planner, and the policy network can be trained to reach optimal states (and/or avoid bad states) given the goal.” & Par. [0037], “In one or more embodiments, the training engine 112 may score the performance of the policy MLM 106 (which in at least one embodiment may be based upon goal data 114 provided from the training engine 112) and/or the value function ML 108 based at least on output data generated by one or more of those MLMs. Based on the scoring, the policy MLM 106 and/or the value function MLM 108 may be updated or revised (e.g., using backpropagation and/or other suitable MLM training techniques).”, therefore, the first action may also be evaluated by the system to compare the score to the goals/desired behavior (threshold) and if the score is below the threshold, the according machine learning models are revised and/or updated to provide better accuracy for future predictions/proposed actions).

Regarding Claim 11, Smolyanskiy teaches the method of claim 10, wherein determining that the first action has a safety score below the threshold includes identifying a minimum safety score from a plurality of scenarios and comparing the minimum safety score to the threshold (Smolyanskiy, Par. [0050], “For example, the training engine 112 may determine, from one or more subsequent state of the environment, a collision of the ego-vehicle (and/or other actors) in the environment. The one or more scores of the value function calculated by the training engine 112 may be based upon the one or more events, e.g., based at least on the determining of the collision (e.g., the policy MLM 106 may be penalized based on determining a collision would occur or is more likely to occur). In one or more embodiments, an event may correspond to or represent a goal to be achieved by the policy MLM 106, as described herein (e.g., reach a location, achieve a world state or attribute, etc.).”, therefore, the determining of the safety score includes the considerations of a plurality of scenarios and comparing said scenarios to a desired behavior/goal (threshold)).

Regarding Claim 12, Smolyanskiy teaches the method of claim 11, wherein the plurality of scenarios each correspond to a distinct environment used in training the reward LNN and the logic LNN (Smolyanskiy, Par. [0050], “The one or more scores of the value function calculated by the training engine 112 may be based upon the one or more events, e.g., based at least on the determining of the collision (e.g., the policy MLM 106 may be penalized based on determining a collision would occur or is more likely to occur).”, therefore, the plurality of scenarios may correspond to a distinct one or more events used in training the machine learning models/networks).

Regarding Claim 13, Smolyanskiy teaches the method of claim 7, wherein the reward LNN and the safety LNN are implemented as recurrent neural networks, with neurons representing logical operations and unique propositions (Smolyanskiy, Par. [0074], “In some embodiments, the DNN 416 may include a recurrent neural network (RNN).”, therefore, the neural networks may be implemented as recurrent neural networks with neurons representing unique logical operations).

Regarding Claim 14, Smolyanskiy teaches a system for automated motion (Smolyanskiy, Abstract, “In various examples, reinforcement learning is used to train at least one machine learning model (MLM) to control a vehicle by leveraging a deep neural network (DNN) trained on real-world data by using imitation learning to predict movements of one or more actors to define a world model.”, therefore, a system for automated motion of a vehicle is disclosed), comprising: a sensor that collects state information about an environment (Smolyanskiy, Par. [0025], “The set of training data may be generated using any combination of various types of vehicle sensors (such as cameras, LiDAR sensors, radar sensors, etc.) from various different vehicles in numerous different real-world situations”, therefore, a sensor is disclosed for collecting state information about an environment); a driving system that performs actions in a vehicle (Smolyanskiy, Figure 11c, which depicts an example autonomous vehicle/system that performs actions in a vehicle); a hardware processor; a memory that stores a computer program, which, when executed by the hardware processor (Smolyanskiy, Claim 16, “A system comprising: one or more processing units; one or more memory units storing instructions that, when executed by the one or more processing units, cause the one or more processing units to execute operations comprising: […]”, thus, a hardware processor and memory are disclosed) , causes the hardware processor to: […]
The rest of the claim language in Claim 14 recites substantially the same limitations as Claim 7, in the form of a system, therefore it is rejected under the same rationale.

Claim 15 recites substantially the same limitations as Claim 8 in the form of a system, therefore it is rejected under the same rationale.
	
Claim 16 recites substantially the same limitations as Claim 9 in the form of a system, therefore it is rejected under the same rationale.

Claim 17 recites substantially the same limitations as Claim 10 in the form of a system, therefore it is rejected under the same rationale.

Claim 18 recites substantially the same limitations as Claim 11 in the form of a system, therefore it is rejected under the same rationale.

Claim 19 recites substantially the same limitations as Claim 12 in the form of a system, therefore it is rejected under the same rationale.

Claim 20 recites substantially the same limitations as Claim 13 in the form of a system, therefore it is rejected under the same rationale.

Claim Rejections - 35 USC § 103
13.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

14.	Claims 1-6 are rejected under 35 U.S.C. 103 as being unpatentable over Smolyanskiy et al. (hereinafter Smolyanskiy) (US PG-PUB 20220138568), in view of Schmidt et al. (hereinafter Schmidt) (“Can You Trust Your Autonomous Car? Interpretable and Verifiably Safe Reinforcement Learning”).
Regarding Claim 1, Smolyanskiy teaches a computer-implemented method for training a model (Smolyanskiy, Abstract, “In various examples, reinforcement learning is used to train at least one machine learning model (MLM) to control a vehicle by leveraging a deep neural network (DNN) trained on real-world data by using imitation learning to predict movements of one or more actors to define a world model”, therefore, a method for training a model is disclosed), comprising: 
learning a plurality of Markov decision processes using reinforcement learning in respective training environments (Smolyanskiy, Par. [0031], “These rare and/or unsafe events may not be properly captured in the training data, thus inadequately preparing the model for the rare and/or unsafe events. Examples of such rare and/or unsafe events may include vehicle cut-ins and harsh braking that might lead to collisions. To address these issues, one or more embodiments of the disclosure use a model-based reinforcement learning framework and formulate the driving problem as a Markov Decision Process (MDP) with the DNN as the transition model. Embodiments of the present disclosure may train an additional policy network to produce ego actions.”, therefore, learning a plurality of Markov decision processes using reinforcement learning in respective training environments is disclosed); 
extracting logic rules from the plurality of Markov decision processes (See introduction of Schmidt reference below for teaching of “extracting” logic rules from the plurality of Markov decision processes); and 
training a reward logic neural network (LNN) and a safety LNN using the logic rules extracted from the plurality of Markov decision processes, wherein the reward LNN and the safety LNN each take a state-action pair as an input and output a corresponding score for the state-action pair (Smolyanskiy, Par. [0036], “FIG. 1A shows the model training system 100 which may be used to train a policy MLM 106 using a trained prediction MLM 104 to define a world model. […] The policy MLM 106 may use data corresponding to the output from the prediction MLM 104 to generate one or more predictions corresponding to one or more actions for a vehicle, such as the vehicle 1100. The value function MLM 108 may also use data corresponding to the output from the prediction MLM 104 to generate one or more predictions corresponding to one or more scores for the one or more actions predicted using the policy MLM 106.”, thus, a reward logic LNN (policy MLM) and a safety LNN (value function MLM) are trained based on the Markov decision processes (See Smolyanskiy Par. [0031] & [0034]) and each neural network takes a state action pair as input and generates corresponding scores based on the input. Further the MLMs may be implemented as recurrent neural networks per Smolyanskiy Par. [0074] – this is supported by Applicants dependent claim 13, which states that the logic neural networks may be implemented as recurrent neural networks).

While Smolyanskiy teaches a model-based reinforcement learning framework utilizing Markov Decision Processes (See Smolyanskiy Par. [0031] & [0034]), Smolyanskiy does not explicitly disclose “extracting” logic rules from the plurality of Markov decision processes.
However, Schmidt teaches extracting logic rules from the plurality of Markov decision processes (Schmidt, Pg. 1, “This paper provides a pipeline that builds on this technique to create policies that are both safe and interpretable. The pipeline trains a non-interpretable RL agent for safe behavior, modifying existing reward structures and training techniques. It then extracts a set of rules approximating the policy in the form of a decision tree” & Pg. 2 “An overview of our pipeline for the training and extraction of safe interpretable policies is illustrated in Fig. 2. First, the safe training step trains an RL agent using PPO [22] towards a safe policy (see Sec. III-A). We formulate the task as a CMDP and use reward modifications for safety distances (similar to Ye et al. [18]) to solve for safe behavior.”, therefore, logic rules are extracted from the plurality of Markov decision processes). 

It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method for training a model, as disclosed by Smolyanskiy to include extracting logic rules from the plurality of Markov decision processes, as disclosed by Schmidt. One of ordinary skill in the art would have been motivated to make this modification to enable the extraction of logic rules which may be used to formally prove correctness, hence improving model accuracy and robustness during training (Schmidt, Pg. 1, “While most work on neural network interpretability [13], [14] fails to explain policy networks, previous work by Bastani et al. [4] uses imitation learning on the policy to extract decision trees that are both interpretable and easy to comprehend through manual analysis. Moreover, the decision trees can be formulated as sets of logical clauses of the input because the policy is piecewise linear. Through formulating system clauses the logical rules can be used to formally prove correctness [4].”).
	
Regarding Claim 2, Smolyanskiy in view of Schmidt teaches the method of claim 1, wherein extracting the logic rules includes identifying state-action pairs in the plurality of Markov decision processes and expressing the state-action pairs as logic propositions (Schmidt, Pg. 2, “In a Markov decision process (MDP) an agent interacts with an environment by taking actions based on the current state of the environment. More formally, a MDP is a tuple (S, A, Pa, Ra), where S is a state space, A is an action space, Pa(s, s0 ) = P(st+1 = s 0 |st = s, at = a) is the transition probability to move from state st to state s 0 with an action at, and Ra(s, s0 ) is a reward the agent receives after transitioning from s to s 0 with action a at time step t. A policy π maps states to actions, i.e., it decides which action to take in each state”, thus, the logic rules extracted includes identifying state-action pairs in the plurality of Markov decision processes. Furthermore, as mentioned on Pg. 1, the rules are formed as part of a decision tree formulated as sets of logical clauses).
The reasons of obviousness have been noted in the rejection of Claim 1 above and applicable herein.

Regarding Claim 3, Smolyanskiy in view of Schmidt teaches the method of claim 1, wherein training the reward LNN includes an objective function that maximizes a reward value while minimizing logical contradictions (Smolyanskiy, Par. [0029], “In one or more embodiments, reinforcement learning used to train a policy network may apply a value function, which may be evaluated based at least on one or more of the states of the environment predicted by the DNN, to assign a score(s) to the predictions made by the MLM. For example, rewards may be associated with one or more goals of the policy network and penalties may be associated with collisions or other predicted or inferred states of the network. In one or more embodiments, the value function may include one or more state value functions and/or q-functions, and states of the value function may correspond to times and locations in the latent space of the DNN.”, thus, training the neural networks includes an objective/value function which maximizes a reward value while minimizing contradictions/penalties).

Regarding Claim 4, Smolyanskiy in view of Schmidt teaches the method of claim 1, wherein training the safety LNN includes an objective function that maximizes a safety value while minimizing logical contradictions (Smolyanskiy, Par. [0031], “In one or more embodiments, the DNN may be trained at least partially using one or more supervised learning techniques, which may include (for example and without limitation) imitation learning. One potential drawback of traditional imitation learning is that conventional implementations often have difficulty handling rare and/or unsafe events. These rare and/or unsafe events may not be properly captured in the training data, thus inadequately preparing the model for the rare and/or unsafe events. Examples of such rare and/or unsafe events may include vehicle cut-ins and harsh braking that might lead to collisions. To address these issues, one or more embodiments of the disclosure use a model-based reinforcement learning framework and formulate the driving problem as a Markov Decision Process (MDP) with the DNN as the transition model. Embodiments of the present disclosure may train an additional policy network to produce ego actions.”, thus, training the safety LNN may include an objective/value function to maximize safety while minimizing logical contradictions/rare or unsafe events).

Regarding Claim 5, Smolyanskiy in view of Schmidt teaches the method of claim 1, further comprising combining the plurality of Markov decision processes into a target constrained Markov decision process (Schmidt, Pg. 3, “To include strict safety requirements, we formulate the problem as a constrained Markov decision process (CMDP). In a CMDP the agent learns to optimize its reward subject to a set of constraints (e.g., safety). Our pipeline adds a reward component for safety distances to solve for safe behavior”, therefore, the plurality of Markov decision processes may be combined into a constrained Markov decision process).
The reasons of obviousness have been noted in the rejection of Claim 1 above and applicable herein.

Regarding Claim 6, Smolyanskiy in view of Schmidt teaches the method of claim 1, wherein the reward LNN and the safety LNN are implemented as recurrent neural networks, with neurons representing logical operations and unique propositions (Smolyanskiy, Par. [0074], “In some embodiments, the DNN 416 may include a recurrent neural network (RNN).”, therefore, the neural networks may be implemented as recurrent neural networks with neurons representing unique logical operations).

Conclusion
15.	THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

16.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to Devika S Maharaj whose telephone number is (571)272-0829. The examiner can normally be reached Monday - Thursday 8:30am - 5:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Alexey Shmatov can be reached at (571)270-3428. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/D.S.M./Examiner, Art Unit 2123                                                                                                                                                                                                        
/ALEXEY SHMATOV/Supervisory Patent Examiner, Art Unit 2123
Read full office action
Prosecution Timeline

Nov 10, 2021
Application Filed
Sep 23, 2025
Non-Final Rejection — §102, §103
Dec 12, 2025
Interview Requested
Dec 18, 2025
Examiner Interview Summary
Dec 18, 2025
Applicant Interview (Telephonic)
Dec 19, 2025
Response Filed
Feb 06, 2026
Final Rejection — §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/655,348
Patent 12585948
NEURAL PROCESSING DEVICE AND METHOD FOR PRUNING THEREOF
2y 5m to grant Granted Mar 24, 2026
17/498,737
Patent 12579426
Training a Neural Network having Sparsely-Activated Sub-Networks using Regularization
2y 5m to grant Granted Mar 17, 2026
17/090,724
Patent 12572795
ANSWER SPAN CORRECTION
2y 5m to grant Granted Mar 10, 2026
17/085,593
Patent 12561577
AUTOMATIC FILTER SELECTION IN DECISION TREE FOR MACHINE LEARNING CORE
2y 5m to grant Granted Feb 24, 2026
17/762,628
Patent 12554969
METHOD AND SYSTEM FOR THE AUTOMATIC SEGMENTATION OF WHITE MATTER HYPERINTENSITIES IN MAGNETIC RESONANCE BRAIN IMAGES
2y 5m to grant Granted Feb 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
55%
Grant Probability
63%
With Interview (+7.7%)
5y 0m
Median Time to Grant
Moderate
PTA Risk
Based on 78 resolved cases by this examiner. Grant probability derived from career allow rate.