Last updated: April 19, 2026
Application No. 17/453,610
DEVICE AND METHOD FOR TRAINING A CONTROL STRATEGY WITH THE AID OF REINFORCEMENT LEARNING

Non-Final OA §101§103§112
Filed
Nov 04, 2021
Examiner
JONES, CHARLES JEFFREY
Art Unit
2122
Tech Center
2100 — Computer Architecture & Software
Assignee
Robert Bosch GmbH
OA Round
3 (Non-Final)
This examiner grants 27% of cases after interview

— +65.9% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 15 resolved cases, 2023–2026
Examiner Intelligence

JONES, CHARLES JEFFREY View full profile →
Grants only 27% of cases
Career Allow Rate
4 granted / 15 resolved
-28.3% vs TC avg
Strong +66% interview lift
Without
With
+65.9%
Interview Lift
resolved cases with interview
Typical timeline
4y 2m
Avg Prosecution
27 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
34.5%
-5.5% vs TC avg
§103
29.1%
-10.9% vs TC avg
§102
17.7%
-22.3% vs TC avg
§112
17.7%
-22.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 15 resolved cases
Office Action

§101 §103 §112
DETAILED ACTION
This action is responsive to the Application/amendment filed on 12/08/2026. Claims 10-14 and 16-20 are pending in the case.  Claims 10, 16, and 17 are independent claims. Claims 11, 13 and 14 are amended. 
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 01/08/2026 has been entered.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 09/15/2025 and 11/25/2025 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Priority
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.



Claim 10-14 and 16-20 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.  Dependent claims inherent lack of clarity and do not cure any of the deficiencies.  

Claim 11 recites the limitation the state in lines 6-9 of claim 11.  There is insufficient antecedent basis for this limitation in the claim. It is ambiguous as to the whether it was referring to other sequences already declared (such as a respective starting state, a respective target state of the robotic, respective intermediary states or for each respective one of the states  from claims10, 16 and 17) or it is creating a reference to new states.

Claim 13, the particular control pass in lines 4-6 of claim 13.   There is insufficient antecedent basis for this limitation in the claim. It is ambiguous as to the whether it was referring to other sequences already declared (such as for each of at least some of the control passes from claim 13 or for each of one or more of the control passes from claim 10) or it is creating a reference to new states.  

Claims 10, 11, 13, 16, 17 and 20 recites the limitation the respective state and respective state of the control pass in lines 11-15, 18-24 and 41-15 of claim 10, lines 2-4 of claim 11, lines 18-22 of claim 13, lines 11-15, 18-24 and 41-15 of claim 16, lines 11-15, 18-24 and 41-15 of claim 17 and lines 2-4 of claim 20.  There is insufficient antecedent basis for this limitation in the claim. It is ambiguous as to the whether it was referring to other sequences already declared a respective starting state, a respective target state of the robotic, respective intermediary states or for each respective one of the states  from claims10, 16 and 17) or it is creating a reference to new states.

Claim Rejections - 35 USC § 101
	Given Applicants arguments and amendments prior 101 rejection has been withdrawn.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claim(s) 10-14 and 16-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Leonetti et al.(“A synthesis of automated planning and reinforcement learning for efficient, robust decision-making”, henceforth known as Leonetti) in view of Yang et al.(“PEORL: Integrating Symbolic Planning and Hierarchical Reinforcement Learning for Robust Decision-Making”, henceforth known as Yang).

Regarding claim 10 
Leonetti discloses A method of a robotic device for self-training a control strategy for itself using reinforcement learning performed by the robotic device navigating within an environment (Leonetti, Page 103, Abstract, “We demonstrate the effectiveness of the proposed method on a service robot, carrying out a variety of tasks in an office building” where the service robot is considered a robotic device and the method is considered reinforcement learning)
Leonetti discloses carrying out multiple reinforcement learning training passes(Leonetti, Page 111, Paragraph 7, “It has an incremental version of the solver designed specifically for planning, which efficiently concatenates these queries for different plan lengths retaining the fluents grounded at the previous iteration”) that each includes the robotic device transitioning from a respective starting state of the robotic device in the environment to a respective target state of the robotic device in the environment(Leonetti, Page 111, Paragraph 5, “Given an initial state and a goal, we want to compute all the plans of length less than a certain threshold” where the planning from initial state to goal is considered transitioning from a respecting starting state to a target state), and one or more of which includes one or more respective intermediary states between the starting state and the target state(Leonetti, Page 109, Paragraph 5, “Lastly, planning has been combined with RL through reward shaping... Reward shaping is a technique to hasten Reinforcement Learning when the reward is sparse, and the agent has to execute a long sequence of actions before getting any feedback about its choices. The reward function is enriched by adding a term which provides feedback for intermediate states, helping guide the agent towards the goal” where Leonetti’s intermediate states between initial state and goal is considered intermediary states.)
Leonetti discloses for each respective one of the states of each of the reinforcement learning training passes prior to reaching the respective target state of the respective reinforcement learning training pass, a respective control pass is performed for selecting a respective action for the robotic device to perform to transition out of the respective state(Leonetti, Page 115, Paragraph 1, “The partial policy computed in the previous phase is used during the execution to constrain the agent’s behavior. At each state in the original MDP D, the agent chooses an action from among the ones returned by the partial policy for that state. Making an informed choice at this stage is the last step of DARLING, and is based on reinforcement learning” where a partial policy is a set-valued action constraint(i.e. for each state, it returns the set of actions) (See also Leonetti, Page 110, Paragraph 5, “execution and learning: during execution, choose only actions that are returned by the partial policy, and learn their expected cumulative long-term reward in order to converge to the optimal solution”))
Leonetti discloses for each of one or more of the control passes(Algorithm 1, lines 4, “for i ←l +1 to L do” where algorithm 1 shows the basic pseudocode and the for loop is considered disclosing doing for each control pass), the selection of the respective action for the respective control pass include applying an answer set programming solver to an answer set programming program(Leonetti, Page 111, Paragraph 2, “We used Answer Set Programming to represent our models, and the answer set solver Clingo” where Clingo is considered applying an answer set programming solver) which models a relationship between actions and successor states that are reached by the actions to ascertain a respective plurality of sequences of states reachable from the respective state of the control pass(Leonetti, Page 111, Paragraph 8, “The answer set solver returns sequences <si, ai> where si is a state, that is a set of fluents, and ai is the action executed in state si, at time step i” where the modeling of states si being executed at time i with action ai is considered modeling sets of actions to sequences) according to a specified planning horizon that predefines a maximum number of states of each of the sequences(Leonetti, Page 111, Paragraph 5, “Given an initial state and a goal, we want to compute all the plans of length less than a certain threshold.” Where the length threshold of Leonetti is considered a predefined maximum number of states)
Leonetti discloses for each of one or more other ones of the control passes, the selection of the respective action for the respective control pass includes a random action selection according to a predefined rate of random exploration; (Leonetti, Page 105, Paragraph 7, “The exploration will be an ϵ-greedy strategy. With the ϵ-greedy strategy, the agent chooses the current optimal action according to qπ with probability 1 − ϵ, and a random action with probability ϵ” where choosing a random action or optimal action with probability ϵ is considered exploring with a predefined rate of random exploration)
Leonetti discloses and for each of the control passes, the robotic device performs the respective selected action of the respective control pass(Leonetti, Page 104, Paragraph 5, “The agent interacts with the environment by choosing an action At” where interacting with the environment after choosing an action is considered a robotic device performing an action of a control pass)

Leonetti does not disclose, however Yang does disclose selecting, from the respective plurality of sequences ascertained for the respective control pass a sequence that delivers a maximum return compared to returns of all other sequences of the respective plurality of sequences each of the returns being a respective sum of rewards that are obtained upon reaching a respective final one of the states of the respective sequence(Yang, Page 5, Col. 2, Paragraph 2, “In Scenario 2 … It starts by trying the shortest plan but during exploration of longer alternatives, it discovers the extra reward, and finally converge to the optimal” where converging to an optimal plan is considered selecting a sequence that delivers the maximum return compared to other sequences (See also Yang, Page 3, Col. 2, Paragraph 2, “A plan Π  … is optimal iff…is maximal among all plans”  and Yang, Page 4, Col. 2, Paragraph 3, “Given a plan Π, the quality of Π is defined by summing up all gain rewards for the transitions in Π”)) and selecting, as the respective action for the robotic device to perform in the respective control pass, an action that, when performed while the robotic device is in the respective state of the control pass, results in a first state of the selected sequence(Yang, Page 3, Col. 1, Paragraph 1, “An option (I, π , β ) is available in state st iff st ϵ I.  After the option is taken, a course of actions is selected according to π until the option is terminated stochastically according to the termination condition β” where selecting an option in the state from the set of options and mapping transition maps is considered selecting an action that results in a first state of a selected sequence(See also Yang, Page 3, Col. 2, Paragraph 5, “We further build one more deterministic layer by mapping a transition path defined by a symbolic plan to a set of options… It is easy to see that the execution of a symbolic plan is deterministically realized by executing their corresponding options sequentially”)

References Leonetti and Yang are analogous art because they are from the same field of technology of using autonomous decision-making systems combining symbolic reasoning and reinforcement learning
Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art, having the teachings of Leonetti and Yang before him or her, to modify the planner of DARLING in Leonetti to include the choosing the path with the maximum cumulative reward of Yang to make the planning objective explicitly reward-optimal and reduce the risk of being optimal only in a reduced space.  The suggestion/motivation for doing so would have been page 5, Col 2, Paragraph 2 “It starts by trying the shortest plan but during exploration of longer alternatives, it discovers the extra reward, and finally converge to the optimal.” and Page 5, Col. 1, Algorithm 1, Line 13 updating planning goal checking quality vs current optimal quality to tighten the goal.

Regarding claim 13:
The rejection of claim 10 is incorporated and further claim recites further additional
elements/limitations: 
Leonetti discloses selecting between different predefined numbers of states as the planning horizon according to a condition (Leonetti, Page 111, Paragraph 7, “Given an encoding of the initial state and the goal, all the answer sets returned for the maximum time step incrementally increased up to L are all the plans of length at most L”)
Leonetti discloses when, for ascertaining the respective action for the particular state control pass a, predefined available computation budget is depleted(Leonetti, Page 110, Paragraph 5, “plan generation: generate all the plans that have a cost lower than a given threshold in the model” where DARLING looking at plans with cost lower than a threshold is considered checking if a predefined computation budget is depleted(Leonetti, Page 128, Paragraph 5, “As a consequence, in the real-world domains the number of plans was often limited by planning time, rather than plan length” where the planning time is equated to computation budget)), 
Leonetti discloses and when, for ascertaining the respective action for the particular control pass, the predefined available computation budget is not yet depleted, second number of states is used as the planning horizon, the second number of states being greater than the first number of states(Leonetti, Page 111, Paragraph 4, “plans are computed in order of length. We take advantage of this feature to define the threshold of plan length as a function of the length of the shortest plans” where the longer DARLING runs, the longer the plans in order of length(states), up to the maximum defined threshold is considered checking if an available computation budget is depleted and, if not, the plan will continue to the next length count(state) which will now have a higher count of states) 
Leonetti discloses the method then comprising ascertaining multiple sequences of states that are reachable from the respective state using the second number of states by applying the answer set programming solver to the answer set programming program which models the relationship between actions and the successor states that are reached by the actions(Leonetti, Page 107, Paragraph 3, “We want to represent a transition system Dm=<Sm, A, fm>(where the subscript stands for model) in answer set programming, where Smis a set of states, Ais a set of actions and fm:  Sm×A →Smis the transition function. This representation has to allow for fast planning, therefore in our models the set Smis always discrete, and the function fm is always deterministic.”)

Leonetti does not disclose, however Yang does disclose a first number of states of the sequence is used as the planning horizon and selecting from the ascertained sequences, using the second number of states, that sequence among the ascertained sequences that delivers the maximum return(Yang, Page 5, Col. 2, Paragraph 2, “In Scenario 2 … It starts by trying the shortest plan but during exploration of longer alternatives, it discovers the extra reward, and finally converge to the optimal” where the finding of better optimal routes is considered finding sequence that deliver the maximum return and finding the maximum return is done whether computation is depleted or not with the difference being what states are considered for the maximum return)


Regarding claim 16 
Leonetti discloses A robotic device that includes a control device configured to self-train a control strategy for the robotic device using reinforcement learning performed by the robotic device navigating within an environment (Leonetti, Page 103, Abstract, “We demonstrate the effectiveness of the proposed method on a service robot, carrying out a variety of tasks in an office building” where the service robot is considered a robotic device with a control device and the method is considered reinforcement learning)
Leonetti discloses carry out multiple reinforcement learning training passes(Leonetti, Page 111, Paragraph 7, “It has an incremental version of the solver designed specifically for planning, which efficiently concatenates these queries for different plan lengths retaining the fluents grounded at the previous iteration”) that each includes the robotic device transitioning from a respective starting state of the robotic device in the environment to a respective target state of the robotic device in the environment(Leonetti, Page 111, Paragraph 5, “Given an initial state and a goal, we want to compute all the plans of length less than a certain threshold” where the planning from initial state to goal is considered transitioning from a respecting starting state to a target state), and one or more of which includes one or more respective intermediary states between the starting state and the target state(Leonetti, Page 109, Paragraph 5, “Lastly, planning has been combined with RL through reward shaping... Reward shaping is a technique to hasten Reinforcement Learning when the reward is sparse, and the agent has to execute a long sequence of actions before getting any feedback about its choices. The reward function is enriched by adding a term which provides feedback for intermediate states, helping guide the agent towards the goal” where Leonetti’s intermediate states between initial state and goal is considered intermediary states.)
Leonetti discloses for each respective one of the states of each of the reinforcement learning training passes prior to reaching the respective target state of the respective reinforcement learning training pass, a respective control pass is performed for selecting a respective action for the robotic device to perform to transition out of the respective state(Leonetti, Page 115, Paragraph 1, “The partial policy computed in the previous phase is used during the execution to constrain the agent’s behavior. At each state in the original MDP D, the agent chooses an action from among the ones returned by the partial policy for that state. Making an informed choice at this stage is the last step of DARLING, and is based on reinforcement learning” where a partial policy is a set-valued action constraint(i.e. for each state, it returns the set of actions) (See also Leonetti, Page 110, Paragraph 5, “execution and learning: during execution, choose only actions that are returned by the partial policy, and learn their expected cumulative long-term reward in order to converge to the optimal solution”))
Leonetti discloses for each of one or more of the control passes(Algorithm 1, lines 4, “for i ←l +1 to L do” where algorithm 1 shows the basic pseudocode and the for loop is considered disclosing doing for each control pass), the selection of the respective action for the respective control pass include applying an answer set programming solver to an answer set programming program(Leonetti, Page 111, Paragraph 2, “We used Answer Set Programming to represent our models, and the answer set solver Clingo” where Clingo is considered applying an answer set programming solver) which models a relationship between actions and successor states that are reached by the actions to ascertain a respective plurality of sequences of states reachable from the respective state of the control pass(Leonetti, Page 111, Paragraph 8, “The answer set solver returns sequences <si, ai> where si is a state, that is a set of fluents, and ai is the action executed in state si, at time step i” where the modeling of states si being executed at time i with action ai is considered modeling sets of actions to sequences) according to a specified planning horizon that predefines a maximum number of states of each of the sequences(Leonetti, Page 111, Paragraph 5, “Given an initial state and a goal, we want to compute all the plans of length less than a certain threshold.” Where the length threshold of Leonetti is considered a predefined maximum number of states)
Leonetti discloses for each of one or more other ones of the control passes, the selection of the respective action for the respective control pass includes a random action selection according to a predefined rate of random exploration; (Leonetti, Page 105, Paragraph 7, “The exploration will be an ϵ-greedy strategy. With the ϵ-greedy strategy, the agent chooses the current optimal action according to qπ with probability 1 − ϵ, and a random action with probability ϵ” where choosing a random action or optimal action with probability ϵ is considered exploring with a predefined rate of random exploration)
Leonetti discloses and for each of the control passes, the robotic device performs the respective selected action of the respective control pass(Leonetti, Page 104, Paragraph 5, “The agent interacts with the environment by choosing an action At” where interacting with the environment after choosing an action is considered a robotic device performing an action of a control pass)

Leonetti does not disclose, however Yang does disclose selecting, from the respective plurality of sequences ascertained for the respective control pass a sequence that delivers a maximum return compared to returns of all other sequences of the respective plurality of sequences each of the returns being a respective sum of rewards that are obtained upon reaching a respective final one of the states of the respective sequence(Yang, Page 5, Col. 2, Paragraph 2, “In Scenario 2 … It starts by trying the shortest plan but during exploration of longer alternatives, it discovers the extra reward, and finally converge to the optimal” where converging to an optimal plan is considered selecting a sequence that delivers the maximum return compared to other sequences (See also Yang, Page 3, Col. 2, Paragraph 2, “A plan Π  … is optimal iff…is maximal among all plans”  and Yang, Page 4, Col. 2, Paragraph 3, “Given a plan Π, the quality of Π is defined by summing up all gain rewards for the transitions in Π”)) and selecting, as the respective action for the robotic device to perform in the respective control pass, an action that, when performed while the robotic device is in the respective state of the control pass, results in a first state of the selected sequence(Yang, Page 3, Col. 1, Paragraph 1, “An option (I, π , β ) is available in state st iff st ϵ I.  After the option is taken, a course of actions is selected according to π until the option is terminated stochastically according to the termination condition β” where selecting an option in the state from the set of options and mapping transition maps is considered selecting an action that results in a first state of a selected sequence(See also Yang, Page 3, Col. 2, Paragraph 5, “We further build one more deterministic layer by mapping a transition path defined by a symbolic plan to a set of options… It is easy to see that the execution of a symbolic plan is deterministically realized by executing their corresponding options sequentially”)

References Leonetti and Yang are analogous art because they are from the [insert the phrase “same field of endeavor” or “problem-solving area,” and the name of that field or area.]
Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art, having the teachings of Leonetti and Yang before him or her, to modify the planner of DARLING in Leonetti to include the choosing the path with the maximum cumulative reward of Yang to make the planning objective explicitly reward-optimal and reduce the risk of being optimal only in a reduced space.  The suggestion/motivation for doing so would have been page 5, Col 2, Paragraph 2 “It starts by trying the shortest plan but during exploration of longer alternatives, it discovers the extra reward, and finally converge to the optimal.” and Page 5, Col. 1, Algorithm 1, Line 13 updating planning goal checking quality vs current optimal quality to tighten the goal.

Regarding claim 17 
Leonetti discloses A non-transitory computer-readable memory medium on is stored a computer program including program instructions that are executable by a robotic device and that, when executed by the robotic device, cause the robotic device to perform a self-training of a control strategy for the robotic device using reinforcement learning performed by the robotic device navigating within an environment(Leonetti, Page 103, Abstract, “We demonstrate the effectiveness of the proposed method on a service robot, carrying out a variety of tasks in an office building” where the service robot is considered a robotic device with computer-readable memory medium and the method is considered reinforcement learning)
Leonetti discloses carrying out multiple reinforcement learning training passes(Leonetti, Page 111, Paragraph 7, “It has an incremental version of the solver designed specifically for planning, which efficiently concatenates these queries for different plan lengths retaining the fluents grounded at the previous iteration”) that each includes the robotic device transitioning from a respective starting state of the robotic device in the environment to a respective target state of the robotic device in the environment(Leonetti, Page 111, Paragraph 5, “Given an initial state and a goal, we want to compute all the plans of length less than a certain threshold” where the planning from initial state to goal is considered transitioning from a respecting starting state to a target state), and one or more of which includes one or more respective intermediary states between the starting state and the target state(Leonetti, Page 109, Paragraph 5, “Lastly, planning has been combined with RL through reward shaping... Reward shaping is a technique to hasten Reinforcement Learning when the reward is sparse, and the agent has to execute a long sequence of actions before getting any feedback about its choices. The reward function is enriched by adding a term which provides feedback for intermediate states, helping guide the agent towards the goal” where Leonetti’s intermediate states between initial state and goal is considered intermediary states.)
Leonetti discloses for each respective one of the states of each of the reinforcement learning training passes prior to reaching the respective target state of the respective reinforcement learning training pass, a respective control pass is performed for selecting a respective action for the robotic device to perform to transition out of the respective state(Leonetti, Page 115, Paragraph 1, “The partial policy computed in the previous phase is used during the execution to constrain the agent’s behavior. At each state in the original MDP D, the agent chooses an action from among the ones returned by the partial policy for that state. Making an informed choice at this stage is the last step of DARLING, and is based on reinforcement learning” where a partial policy is a set-valued action constraint(i.e. for each state, it returns the set of actions) (See also Leonetti, Page 110, Paragraph 5, “execution and learning: during execution, choose only actions that are returned by the partial policy, and learn their expected cumulative long-term reward in order to converge to the optimal solution”))
Leonetti discloses for each of one or more of the control passes(Algorithm 1, lines 4, “for i ←l +1 to L do” where algorithm 1 shows the basic pseudocode and the for loop is considered disclosing doing for each control pass), the selection of the respective action for the respective control pass include applying an answer set programming solver to an answer set programming program(Leonetti, Page 111, Paragraph 2, “We used Answer Set Programming to represent our models, and the answer set solver Clingo” where Clingo is considered applying an answer set programming solver) which models a relationship between actions and successor states that are reached by the actions to ascertain a respective plurality of sequences of states reachable from the respective state of the control pass(Leonetti, Page 111, Paragraph 8, “The answer set solver returns sequences <si, ai> where si is a state, that is a set of fluents, and ai is the action executed in state si, at time step i” where the modeling of states si being executed at time i with action ai is considered modeling sets of actions to sequences) according to a specified planning horizon that predefines a maximum number of states of each of the sequences(Leonetti, Page 111, Paragraph 5, “Given an initial state and a goal, we want to compute all the plans of length less than a certain threshold.” Where the length threshold of Leonetti is considered a predefined maximum number of states)
Leonetti discloses for each of one or more other ones of the control passes, the selection of the respective action for the respective control pass includes a random action selection according to a predefined rate of random exploration; (Leonetti, Page 105, Paragraph 7, “The exploration will be an ϵ-greedy strategy. With the ϵ-greedy strategy, the agent chooses the current optimal action according to qπ with probability 1 − ϵ, and a random action with probability ϵ” where choosing a random action or optimal action with probability ϵ is considered exploring with a predefined rate of random exploration)
Leonetti discloses and for each of the control passes, the robotic device performs the respective selected action of the respective control pass(Leonetti, Page 104, Paragraph 5, “The agent interacts with the environment by choosing an action At” where interacting with the environment after choosing an action is considered a robotic device performing an action of a control pass)

Leonetti does not disclose, however Yang does disclose selecting, from the respective plurality of sequences ascertained for the respective control pass a sequence that delivers a maximum return compared to returns of all other sequences of the respective plurality of sequences each of the returns being a respective sum of rewards that are obtained upon reaching a respective final one of the states of the respective sequence(Yang, Page 5, Col. 2, Paragraph 2, “In Scenario 2 … It starts by trying the shortest plan but during exploration of longer alternatives, it discovers the extra reward, and finally converge to the optimal” where converging to an optimal plan is considered selecting a sequence that delivers the maximum return compared to other sequences (See also Yang, Page 3, Col. 2, Paragraph 2, “A plan Π  … is optimal iff…is maximal among all plans”  and Yang, Page 4, Col. 2, Paragraph 3, “Given a plan Π, the quality of Π is defined by summing up all gain rewards for the transitions in Π”)) and selecting, as the respective action for the robotic device to perform in the respective control pass, an action that, when performed while the robotic device is in the respective state of the control pass, results in a first state of the selected sequence(Yang, Page 3, Col. 1, Paragraph 1, “An option (I, π , β ) is available in state st iff st ϵ I.  After the option is taken, a course of actions is selected according to π until the option is terminated stochastically according to the termination condition β” where selecting an option in the state from the set of options and mapping transition maps is considered selecting an action that results in a first state of a selected sequence(See also Yang, Page 3, Col. 2, Paragraph 5, “We further build one more deterministic layer by mapping a transition path defined by a symbolic plan to a set of options… It is easy to see that the execution of a symbolic plan is deterministically realized by executing their corresponding options sequentially”)

References Leonetti and Yang are analogous art because they are from the [insert the phrase “same field of endeavor” or “problem-solving area,” and the name of that field or area.]
Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art, having the teachings of Leonetti and Yang before him or her, to modify the planner of DARLING in Leonetti to include the choosing the path with the maximum cumulative reward of Yang to make the planning objective explicitly reward-optimal and reduce the risk of being optimal only in a reduced space.  The suggestion/motivation for doing so would have been page 5, Col 2, Paragraph 2 “It starts by trying the shortest plan but during exploration of longer alternatives, it discovers the extra reward, and finally converge to the optimal.” and Page 5, Col. 1, Algorithm 1, Line 13 updating planning goal checking quality vs current optimal quality to tighten the goal.

Regarding claim 18:
The rejection of claim 10 is incorporated and further claim recites further additional
elements/limitations:
Leonetti discloses for each of at least one of the performed actions: observing, by the robotic device, a respective reward and a respective subsequent state achieved by the respective action(Leonetti, Page 104, Paragraph 5, “We consider the system at discrete time steps. Let t ∈ N be the current time, and St be the state at time t. The agent interacts with the environment by choosing an action At and perceiving the next state St+1…It also receives a reward Rt+1:”) and updating a target control policy based on the observation(Leonetti, Page 115, Paragraph 1, “At each state in the original MDP D, the agent chooses an action from among the ones returned by the partial policy for that state. Making an informed choice at this stage is the last step of DARLING, and is based on reinforcement learning.”)

Regarding claim 19:
The rejection of claim 18 is incorporated and further claim recites further additional
elements/limitations:
Leonetti discloses wherein each of one or more of the control passes includes selecting, based on the rate of random exploration, between using the random action selection and using the target control policy for selecting the action to be performed by the robotic device for the respective control pass(Leonetti, Page 105, Paragraph 7, “The exploration will be an ϵ-greedy strategy. With the ϵ-greedy strategy, the agent chooses the current optimal action according to qπ with probability 1 − ϵ, and a random action with probability ϵ” where the probability of choosing the current optimal action and random action is considered selecting between a random action and target control policy based on random exploration)

Regarding claim 20:
The rejection of claim 19 is incorporated and further claim recites further additional
elements/limitations:
Leonetti discloses wherein the selection between using the random action selection and using the target control policy is performed for each of the control passes whose respective state is determined by the robotic device to have been previously encountered during any of the training passes(Leonetti, Page 118, Paragraph 5, “ “We always used λ =0.9and γ=1, since all tasks are episodic” where the learning is episodic is considered a robotic device taking into account if a state has been previously encountered as the learning being episodic takes previous training passes into account when learning) 

Claim(s) 10-14 and 16-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Leonetti et al.(“A synthesis of automated planning and reinforcement learning for efficient, robust decision-making”, henceforth known as Leonetti) in view of Yang et al.(“PEORL: Integrating Symbolic Planning and Hierarchical Reinforcement Learning for Robust Decision-Making”, henceforth known as Yang) and further in view of Lin et al.(“Self-Improving Reactive Agents Based On Reinforcement Learning, Planning and Teaching” henceforth known as Lin)

Regarding claim 11:
The rejection of claim 10 is incorporated and further claim recites further additional
elements/limitations:

	Yang discloses selecting the sequence among the ascertained sequences that delivers the maximum return and selecting the respective action via which the first state of the selected sequence,  starting from the state, may be reached, when the state was reached for the first time in the multiple reinforcement learning training passes(Yang, Page 5, Col. 2, Paragraph 2, “In Scenario 2 … It starts by trying the shortest plan but during exploration of longer alternatives, it discovers the extra reward, and finally converge to the optimal.” where the BRI of the limitation encompasses finding an optimal path using the new state using a maximum return reward (See Also Yang, Page 3, Col. 2, Paragraph 2, “A plan Π  … is optimal iff…is maximal among all plans” where optimal being a plan that is maximum rewards among all plans and Yang, Page 2, Col.1, Paragraph 3, “The PEORL agent outperforms planning agent by discovering a new state that leads to extra reward and reducing the number of execution failure”)

	Yang-Leonetti does not disclose, however Lin does disclose wherein for each of the control passes in a reinforcement learning training pass, a check is made as to whether the respective state of the respective control pass was reached for the first time in the multiple reinforcement learning training passes and the respective action is ascertained by ascertaining the plurality of sequences(Lin, Page 77, Paragraph 5, “The idea is this: For situations where the policy is very decisive about the best action,  relaxation planning is not needed. If the policy cannot be very sure about which is the  best action, relaxation planning is performed” where the policy being unsure about the best action corresponds to encountering a new state as if the state is new it cannot know the best action and relaxation planning corresponds to ascertaining sequences (See also, Lin, Page 76, Paragraph 5, “Using an accurate action model (if it is available), the agent can experience the consequences of actions without participating in the real world” that emphasizes it simulates to ascertain the best course of action)

References Leonetti-Yang and Lin are analogous art because they are from the same field of technology of reinforcement learning with and planning for autonomous agents
Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art, having the teachings of Leonetti-Yang and Lin before him or her, to modify the planner of DARLING in Leonetti-Yang to include the planning phase and trigger of Lin to reduce the frequency of planning required.  The suggestion/motivation for doing so would have been page 77, Paragraph 5, “In this way, at the beginning of learning, all actions are equally good and relaxation planning is performed frequently. As learning proceeds, relaxation planning is performed less and less often.” 

Regarding claim 12:
The rejection of claim 11 is incorporated and further claim recites further additional
elements/limitations:
Leonetti discloses wherein for each state that has already been reached in the multiple reinforcement learning training passes, the respective action is selected according to a previously trained control strategy(Leonetti, Page 115, Paragraph 1, “the agent chooses an action from among the ones returned by the partial policy for that state. Making an informed choice at this stage is the last step of DARLING, and is based on reinforcement learning”), or randomly(Leonetti, Page 105, Paragraph 7, “The exploration will be an ϵ-greedy strategy. With the ϵ-greedy strategy, the agent chooses the current optimal action according to qπ with probability 1 − ϵ, and a random action with probability ϵ.”)

Regarding claim 14:
The rejection of claim 11 is incorporated and further claim recites further additional
elements/limitations:
Leonetti discloses wherein the answer set programming solver assists with multi-shot solving, and the multiple sequences for successive states in each reinforcement learning training pass are ascertained by multi-shot solving using the answer set programming solver by solving each of one or more control passes of a sequence of the control passes by reusing information from prior control passes(Leonetti, Page 111, Paragraph 7, “Given an encoding of the initial state and the goal, all the answer sets returned for the maximum time step incrementally increased up to L are …All the queries with different maximum lengths up to L would be separate, but the answer set solver Clingo is particularly efficient at this computation. It has an incremental version of the solver designed specifically for planning, which efficiently concatenates these queries for different plan lengths retaining the fluents grounded at the previous iteration” where Clingo’s incremental mode is considered multi-shot solving as the increase the maximum step up to L and “all the answer sets returned” across this incremental steps are all steps of length <= L, considered many “shots” (solve calls) and the time step is grounded up to a constant n, and the solver increments n to expand the horizon)

Response to Arguments
Applicant's arguments filed 12/08/2025 have been fully considered but they are not persuasive. A breakdown of the arguments can be found below:

112:
	Applicant appears to argue on pages 11-12 that “the state” references to “the state” of “the respective state of the respective control pass” and “the state” properly refers to “the respect state”.
Examiner respectfully disagrees as it is ambiguous as to the whether it was referring to other sequences already declared a respective starting state, a respective target state of the robotic, respective intermediary states or for each respective one of the states  from claims10, 16 and 17) or it is creating a reference to new states as “the state” has not been explicitly defined and therefor any implied definition cannot be ambiguous by potentially referencing multiple previously defined state(s).

Applicant appears to argue on pages 11-12 that “the respective state” references to each state of the “for each respective one of the states” and “the respective state” properly refers to “the respect state”.  
Examiner respectfully disagrees as it is ambiguous as to the whether it was referring to other sequences already declared (such as a respective starting state, a respective target state of the robotic, respective intermediary states or for each respective one of the states  from claims 10, 16 and 17) or it is creating a reference to new states as “the respective state” has not been explicitly defined and therefor any implied definition cannot be ambiguous by potentially referencing multiple previously defined respective state(s).

103:
	Applicant appears to argue that on page 13-14- that the claimed invention implements a per-state decision framework in which ASP or RL is selected at runtime at runtime as the robot operates and the prior art does not disclose the invention by citing .
 	Examiner respectfully disagrees as the features referenced, such as ASP or RL being selected at runtime at the robot operation, are not positively recited in the manner Applicant asserts, at least not in the cited limitations of claims 10, 16 or 17.  Examiner interprets the independent claims 10, 16 or 17 as running reinforcement learning training passes and for each state reached an action(a control decision, such as moving a robots left arm or adjusting motor speed, that is to be executed) is chosen using ASP and RL.   This does not reflect the “ASP or RL being selected at runtime at the robot operation” as ASP/RL are not selected and both ASP and RL are used for each control pass.  Examiner interprets the claims such that for each state a respective control pass is performed to select a respective action and for each control pass the section of the action in includes applying answer set programming and selecting the action with the maximum return.  The action selected is either randomly selected or according to a rate of random exploration.  With this interpretation ASP and RL are used for each state.

	Applicant appears to argue that on page 13-15 appears to argue to that Leonetti’s planning is done offline(i.e. not during runtime) while Yang’s PEORL only performs ASP in an initial planning stage and reinforcement learning in a separate execution stage and neither exhibit a dynamic(RL or ASP) per state decision making that interleaves during live operation.
Examiner respectfully disagrees as the Leonetti RL is done offline, however Yang discusses planning can be triggered before the reinforcement-learning training is completed during runtime. See Yang, Page 6, Col. 1, Paragraph 2, “By contrast, in our work, planning is interleaved with and constantly updated by RL, and therefore new options can be explored and more meaningful ones will be selected leveraging learning” where interleaving planning during RL requires a runtime implementation.  In light of Applicant arguments Examiner has added additional prior art Lin(Lin, Page 77, Paragraph 5, “The idea is this: For situations where the policy is very decisive about the best action, relaxation planning is not needed. If the policy cannot be very sure about which is the  best action, relaxation planning is performed” that discusses that a per state decision is made switching to planning if  the learned policy is unsure about the best action.  

	Applicant appears to argue that on page 16 appears to argue to that Leonetti’s nor Yang discloses a bounded-horizon symbolic lookahead mechanism that has “a planning horizon that predefines a maximum number of states” of a sequence and that the selected action is the “first state” of a sequence.  Applicant argues that Leonetti’s operates once at the pre-execution planning stage which plays no role in the selection of actions.
Examiner respectfully disagrees as Examiner’s understanding of Leonetti is that the plans created by Leonetti are based on length (“maximum number of states”) and uses the initial state(“a first state of a sequence”) that will reach a goal state.  See Leonetti, Page 111, Paragraph 4, “plans are computed in order of length. We take advantage of this feature to define the threshold of plan length as a function of the length of the shortest plans” and Leonetti, Page 111, Paragraph 5, “Given an initial state and a goal, we want to compute all the plans of length less than a certain threshold.”  Further, the lack of runtime component from Leonetti is remedied in the combination of Yang as the references must be looked as a combination.

101:
	101 rejection has been withdrawn given arguments and amendments filed 10/27/2025 rendering the 101 arguments moot.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CHARLES JEFFREY JONES JR whose telephone number is (703)756-1414. The examiner can normally be reached Monday - Friday 8:00 - 5:00 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached at 571-272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/C.J.J./Examiner, Art Unit 2122                                                                                                                                                                                                        
/KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122
Read full office action
Prosecution Timeline

Nov 04, 2021
Application Filed
Feb 20, 2025
Non-Final Rejection — §101, §103, §112
May 27, 2025
Response Filed
Sep 04, 2025
Final Rejection — §101, §103, §112
Dec 08, 2025
Response after Non-Final Action
Jan 08, 2026
Request for Continued Examination
Jan 24, 2026
Response after Non-Final Action
Mar 06, 2026
Non-Final Rejection — §101, §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/407,770
Patent 12582959
DATA GENERATION DEVICE AND METHOD, AND LEARNING DEVICE AND METHOD
2y 5m to grant Granted Mar 24, 2026
17/519,815
Patent 12380333
METHOD OF CONSTRUCTING NETWORK MODEL FOR DEEP LEARNING, DEVICE, AND STORAGE MEDIUM
2y 5m to grant Granted Aug 05, 2025
Study what changed to get past this examiner. Based on 2 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
27%
Grant Probability
93%
With Interview (+65.9%)
4y 2m
Median Time to Grant
High
PTA Risk
Based on 15 resolved cases by this examiner. Grant probability derived from career allow rate.