Last updated: April 19, 2026
Application No. 18/313,310
REINFORCEMENT LEARNING USING LIFTED ACTION MODELS

Non-Final OA §101§103§112
Filed
May 05, 2023
Examiner
NAULT, VICTOR ADELARD
Art Unit
2124
Tech Center
2100 — Computer Architecture & Software
Assignee
International Business Machines Corporation
OA Round
1 (Non-Final)
Interview Optional

— +83.3% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 13 resolved cases, 2023–2026
Examiner Intelligence

NAULT, VICTOR ADELARD View full profile →
Grants 62% of resolved cases
Career Allow Rate
8 granted / 13 resolved
+6.5% vs TC avg
Strong +83% interview lift
Without
With
+83.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 11m
Avg Prosecution
30 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
29.1%
-10.9% vs TC avg
§103
40.4%
+0.4% vs TC avg
§102
7.5%
-32.5% vs TC avg
§112
21.4%
-18.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 13 resolved cases
Office Action

§101 §103 §112
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 112(b)
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 3, 10, and 17 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Regarding claim 3,
Claim 3 recites the limitation wherein the training for the particular one of the parameter options includes storing, however it is unclear what the term “the parameter options” is referring to, as no “parameter options” are referred to previously within claim 3, or within claim 1, the parent claim of claim 3, though the similar term “parameterized options” is recited numerous times. The term “the parameter options” thus lacks antecedent basis. For examination purposes, the limitation will be interpreted as reading “wherein the training for the particular one of the parameterized options includes storing”.

Regarding claim 10,
Claim 10 recites a system for performing the function of the method of claim 3. All other limitations in claims 10 are substantially the same as those in claim 3, therefore claim 10 is considered indefinite with an equivalent rationale and is interpreted for examination purposes in the same way.

Regarding claim 17,
Claim 17 recites a medium containing a computer program product for performing the function of the method of claim 3. All other limitations in claims 17 are substantially the same as those in claim 3, therefore claim 17 is considered indefinite with an equivalent rationale and is interpreted for examination purposes in the same way.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a No therefor, subject to the conditions and requirements of this title.

Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to abstract ideas without significantly more.

Regarding claim 1,
Step 1 - “Is the claim to a process, machine, manufacture or composition of matter?”
Yes, the claim is directed towards a process.
Step 2A, Prong 1 - “Is the claim directed to a law of nature, a natural phenomenon (product of nature) or an abstract idea?”:
The limitation of generating a mapping function between MDP states of a MDP within the MDP distribution and planning states of the planning domain; recites an evaluation of a mapping function, which is a mental process, which is an abstract idea, regardless of if it’s performed on a generic computer or using a generic machine learning model.
The limitation of defining, using the mapping function, a parameterized option for each of the lifted action models; recites an evaluation of options, which is a mental process, which is an abstract idea, regardless of if it’s performed on a generic computer or using a generic machine learning model.
Step 2A, Prong 2 - “Does the claim recite additional elements that integrate the judicial exception into a practical application?”:
The limitation of receiving a planning domain including lifted action models; recites the mere extra-solution activity of data gathering, which does not integrate the exception into a practical application, MPEP 2106.05(d) and 2106.05(g).
The limitation of receiving a Markov Decision Process (MDP) distribution; recites the mere extra-solution activity of data gathering, which does not integrate the exception into a practical application, MPEP 2106.05(d) and 2106.05(g).
The limitation of training, using reinforcement learning, an intra-option policy for each of the parameterized options recites mere instructions to apply reinforcement learning to train intra-option policies, which does not integrate the exception into a practical application, MPEP 2106.05(d) and 2106.05(f).
Step 2B - “Does the claim recite additional elements that amount to significantly more than the judicial exception?”:
The limitation of receiving a planning domain including lifted action models; recites receiving data over a network, which is well-understood, routine, and conventional, MPEP 2106.05(d).II., example (i) of WURC computer functions.
The limitation of receiving a Markov Decision Process (MDP) distribution; recites receiving data over a network, which is well-understood, routine, and conventional, MPEP 2106.05(d).II., example (i) of WURC computer functions.
The limitation of training, using reinforcement learning, an intra-option policy for each of the parameterized options recites mere instructions to apply reinforcement learning to train intra-option policies, which is not significantly more than any recited judicial exceptions, MPEP 2106.05(d) and 2106.05(f).
Therefore, claim 1 is found to be ineligible subject matter under 35 U.S.C. 101.

Regarding claim 2,
Claim 2 adds the additional limitation to claim 1:
wherein a particular one of the parameterized options is defined as: an initiation set for the particular one of the parameterized options, one or more option parameters, a termination condition for the particular one of the parameterized options, and an intra-option policy for the particular one of the parameterized options recites further detail on the parameterized options that are defined with a mapping function, without changing that the evaluation of the mapping function amounts to the abstract idea of evaluation, regardless of if it’s performed on a generic computer.
Therefore, claim 2 is found to be ineligible subject matter under 35 U.S.C. 101.

Regarding claim 3,
Claim 3 adds the additional limitation to claim 2:
initializing a replay buffer for the particular one of the parameterized options, recites mere instructions to apply a replay buffer for an option, which does not integrate the exceptions into a practical application, and is not significantly more than any recited judicial exceptions, MPEP 2106.05(d) and 2106.05(f).
the training for the particular one of the parameter options includes storing, within the replay buffer and for a particular action from the intra-option policy for the particular one of the parameterized options, data including: an initial state, the particular action, a reward, a subsequent state, and one or more values associated with the one or more option parameters, recites the mere extra-solution activity of selecting a particular type of data to be manipulated, which does not integrate the exception into a practical application, MPEP 2106.05(d) and 2106.05(g), and which recites storing information in memory, which is well-understood, routine, and conventional, MPEP 2106.05(d).II., example (iv) of WURC computer functions.
and the intra-option policy for the particular one of the parameterized options is updated using the data recites mere instructions to apply a data to update a policy, which does not integrate the exceptions into a practical application, and is not significantly more than any recited judicial exceptions, MPEP 2106.05(d) and 2106.05(f).
Therefore, claim 3 is found to be ineligible subject matter under 35 U.S.C. 101.

Regarding claim 4,
Claim 4 adds the additional limitations to claim 1:
the MDP distribution defines an environment including a plurality of MDP meeting constraints, recites an evaluation of an environment, which is a mental process, which is an abstract idea, regardless of if it’s performed on a generic computer or using a generic machine learning model.
and the constraints including a predicate, action, object, type, and action model recites further detail on the constraints within an environment, without changing that defining an environment amounts to the abstract idea of evaluation, regardless of if it’s performed on a generic computer.
Therefore, claim 4 is found to be ineligible subject matter under 35 U.S.C. 101.

Regarding claim 5,
Claim 5 adds the additional limitation to claim 4:
wherein the policy for performing the goal is configured to be used with a second MDP that meets the constraints recites further detail on the policy that is trained, without changing that generic training of a policy amounts to mere instructions to mere instructions to apply, which does not integrate the exceptions into a practical application, and is not significantly more than any recited judicial exceptions, MPEP 2106.05(d) and 2106.05(f).
Therefore, claim 5 is found to be ineligible subject matter under 35 U.S.C. 101.

Regarding claim 6,
Claim 6 adds the additional limitation to claim 1:
wherein the mapping is generated using planning annotated reinforcement learning (PaRL) recites mere instructions to apply a planning annotated reinforcement learning to generate a mapping, which does not integrate the exceptions into a practical application, and is not significantly more than any recited judicial exceptions, MPEP 2106.05(d) and 2106.05(f).
Therefore, claim 6 is found to be ineligible subject matter under 35 U.S.C. 101.

Regarding claim 7,
Claim 7 adds the additional limitation to claim 1:
wherein the planning domain includes a plurality of options for performing the goal recites further detail on the planning domain that is received, without changing that receiving a planning domain amounts to the mere extra-solution activity of data gathering, which does not integrate the exception into a practical application, MPEP 2106.05(d) and 2106.05(g), and which recites receiving data over a network, which is well-understood, routine, and conventional, MPEP 2106.05(d).II., example (i) of WURC computer functions.
Therefore, claim 7 is found to be ineligible subject matter under 35 U.S.C. 101.

Regarding claims 8-14,
Claims 8-14 recite a computer hardware system, which is a machine, for performing the function of the method of claims 1-7, respectively, with substantially the same limitations. Therefore the same analysis and rejection applied to claim 1-7 applies to claims 8-14. 
Therefore, claims 8-14 are found to be ineligible subject matter under 35 U.S.C. 101.

Regarding claims 15-20,
Claims 15-20 recite a computer program product, comprising a computer readable storage medium, which is a manufacture, for performing the function of the method of claims 1-6, respectively, with substantially the same limitations. Therefore the same analysis and rejection applied to claim 1-6 applies to claims 15-20. 
Therefore, claims 15-20 are found to be ineligible subject matter under 35 U.S.C. 101.

Prior Art
The following references are used for prior art claim rejections:
Lee et al. “AI Planning Annotation for Sample Efficient Reinforcement Learning”
Konidaris and Doshi-Velez “Hidden Parameter Markov Decision Processes: An Emerging Paradigm for Modeling Families of Related Tasks”
Juba et al. “Safe Learning of Lifted Action Models”
Horcik and Fiser “Endomorphisms of Lifted Planning Problems”

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3, 6-10, 13-17, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Lee et al. “AI Planning Annotation for Sample Efficient Reinforcement Learning”, hereinafter Lee, in view of Konidaris and Doshi-Velez “Hidden Parameter Markov Decision Processes: An Emerging Paradigm for Modeling Families of Related Tasks”, hereinafter Konidaris, further in view of Juba et al. “Safe Learning of Lifted Action Models”, hereinafter Juba.

Regarding claim 1,
Lee teaches A computer-implemented method for generating a policy for performing a goal ((Lee Pg. 2) “In this goal oriented environment, we are interested in the sparse reward task where the r(s, a) is a constant step cost when s ∉ G and zero in a goal state, and the objective is to learn a stationary optimal policy π* that maximizes the expected return, π*”) and including a plurality of intra-option policies, ((Lee Pg. 8) “We design a general method for injecting intrinsic rewards to RL agents from the abstract planning task by reformulating the underlying decomposed sub-MDPs with constraints visible to planning agents. Learning only the (intra-option) policies for these sub-MDPs is shown to work well in practice on various problems, significantly improving sample efficiency”) comprising:
receiving a planning domain [including lifted action models]; ((Lee Pg. 5)) “We also constructed the PaRL task E = <M, Π, L> by modeling a planning task based on the domain knowledge with an appropriate state mapping function L”, Lee does not teach lifted action models)
receiving a Markov Decision Process (MDP) [distribution]; ((Lee Pg. 2) “In reinforcement learning, we assume that an agent interacts with a goal oriented MDP M = <S, 𝒜, P, r, s0, G, γ> with a set of states S, a set of actions 𝒜, a state transition function P : S × 𝒜 × S → [0, 1], a reward function r : S × A → R, an initial state s0 ∈ S, a set of goal states G ⊂ S, and a discounting factor γ ∈ (0, 1) for the rewards”, Lee does not teach a distribution for the Markov Decision Processes)
generating a mapping function between MDP states of a MDP [within the MDP distribution] and planning states of the planning domain; ((Lee Pg. 3) “A PaRL task is a triple E := <M, Π, L>, where M := <S, A, P, r, s0, G, γ> is a goal-oriented MDP over RL states S, Π := <V, O, s’0, s∗> is a planning task over planning states S, and L : S → S’ is a surjective mapping from the states of the MDP S to planning task S’ satisfying s’0 = L(s0) and s∗ consistent with L(s) for all s ∈ G”, the PaRL task includes a mapping L between MDP S states and planning task S’, Lee does not teach an MDP distribution)
training, using reinforcement learning, an intra-option policy ((Lee Pg. 5) “RL agents also have the freedom to train the SMDP policy µ and intra-option policies πOo using either off-policy or on-policy algorithms”) for each of the [parameterized] options (Lee Pg. 5, Algorithm 1 shows training of policies for each option on lines 15 and 16)

    PNG
    media_image1.png
    700
    665
    media_image1.png
    Greyscale

Konidaris teaches the following further limitation that Lee does not completely teach:
receiving a Markov Decision Process (MDP) distribution; ((Konidaris Pg. 1) “HIP-MDPs model a distribution of tasks where the variation in the dynamics (or the reward function, though we do not discuss that case here) across the family of tasks can be captured by a set of hidden parameters, θ, encountered with probability Pθ”, (Konidaris Pg. 2) “HIP-MDPs have some useful properties which suggest that they merit study in their own right. One is that any specific task instance is an MDP, and can be solved independently as one”)
At the time of filing, one of ordinary skill in the art would have motivation to combine Lee and Konidaris by taking the method for creating mappings between Markov Decision Process (MDP) states and planning states to train intra-option policies, taught by Lee, and including the MDPs being from an MDP distribution, taught by Konidaris, as Konidaris teaches: (Konidaris Pg. 2) “However, HIP-MDPs have some useful properties which suggest that they merit study in their own right. One is that any specific task instance is an MDP, and can be solved independently as one…This allows us to model scenarios where the source tasks for transfer learning can be solved directly by reinforcement learning. Moreover, the hidden parameters θ are constant for each task…These properties allow us to gather large batches of transition data that correspond to fixed settings of θ, easing model learning. Finally, the latent parameters are a sufficient statistic for specifying an individual task…Thus, if we have a model of the HIP-MDP, we can expend great computational effort offline to solve various task instances, and then synthesize a parameterized policy (da Silva, Konidaris, and Barto 2012) to be deployed given the agent’s belief over θb at runtime”, that is, that learning policies for MDPs in an MDP distribution allows for policies to be learned faster, and for that learning to be transferred between MDPs within the distribution, both of which reduce the time and cost of learning. Such a combination would be obvious.
Juba teaches the following further limitations that neither Lee, nor Konidaris teach:
receiving a planning domain including lifted action models; ((Juba Pg. 3) “A classical planning domain is defined by a tuple <T, F, 𝒜, M> where T is a set of types, F is a set of lifted fluents, 𝒜 is a set of lifted actions, and M is an action model for 𝒜”)
defining, using the mapping function, a parameterized option for each of the lifted action models; ((Juba Pg. 2) “A lifted action A ∈ 𝒜 is a pair <name, params> where name is a symbol and params is a list of types, denoted name(A) and params(A), respectively, and arity(A, t) denotes the number of type-t parameters. The action model M for a set of actions A is a pair of functions preM and effM that map every action in 𝒜 to its preconditions and effects. To define the preconditions and effects of a lifted action, we first define the notion of a parameter-bound literal. A parameter binding of a lifted literal L and an action A is a function bL,A : params(L) → params(A) that maps every parameter of L to a parameter in A”, a set of actions corresponds to an option)
At the time of filing, one of ordinary skill in the art would have motivation to combine Lee, Konidaris, and Juba by taking the method for creating mappings between Markov Decision Process (MDP) states and planning states to train intra-option policies, with the MDPs being from a distribution, jointly taught by Lee and Konidaris, and including lifted action models in the planning domain which are each mapped to parameterized options, taught by Juba, as Juba teaches: (Juba Pgs. 1-2) “Stern and Juba (2017) proposed a sound algorithm for safe model-free planning,…However, their positive result is limited to grounded domain models, that is, domains that are not defined by lifted, i.e., parameterized, actions and fluents. The size of a grounded domain model can be arbitrarily larger than its corresponding lifted domain model. In particular, a single lifted action can yield a number of grounded actions that grow polynomially with the number of objects in the domain, with the number of parameters of the lifted action as its exponent. In addition, learning a grounded domain model limits the generalization possible between different groundings of the same lifted domain. For example, a grounded action model for a blocksworld domain with 8 blocks cannot be used to solve problems for a blocksworld domain with 9 blocks”, that is, that lifted action models that support the inclusion of parameters are significantly more efficient and applicable to a wider variety of problems than their alternative. Such a combination would be obvious.

Regarding claim 2,
Lee, Konidaris, and Juba jointly teach The method of claim 1, 
Lee further teaches:
wherein a particular one of the [parameterized] options is defined as: ((Lee Pg. 2) “A set of options 𝒪 formalizes the temporally extended actions that defines a semi-MDP (SMDP) over the original MDP M. A Markovian option O ∈ 𝒪 is a triple <IO, πO, βO>”, Juba but not Lee teaches parameters for options)
an initiation set for the particular one of the [parameterized] options, ((Lee Pg. 2) “A Markovian option O ∈ 𝒪 is a triple <IO, πO, βO>, where the IO is the initiation set in which O can begin”, Juba but not Lee teaches parameters for options)
a termination condition for the particular one of the [parameterized] options, ((Lee Pg. 2) “A Markovian option O ∈ 𝒪 is a triple <IO, πO, βO>, where…the βO is a termination set in which O terminates”, Lee does not teach parameters for options)
and an intra-option policy for the particular one of the [parameterized] options ((Lee Pg. 2) “A Markovian option O ∈ 𝒪 is a triple <IO, πO, βO>, where…the πO is a stationary option policy πO : S × A → [0, 1]”, Juba but not Lee teaches parameters for options)
Juba further teaches:
one or more option parameters, ((Juba Pg. 2) “A lifted action A ∈ 𝒜 is a pair <name, params> where name is a symbol and params is a list of types, denoted name(A) and params(A), respectively, and arity(A, t) denotes the number of type-t parameters. The action model M for a set of actions A is a pair of functions preM and effM that map every action in 𝒜 to its preconditions and effects”, a set of actions corresponds to an option)
At the time of filing, one of ordinary skill in the art would have motivation to combine the method jointly taught by Lee, Konidaris, and Juba for the parent claim of claim 2, claim 1. No new embodiments are introduced, so the reason to combine is the same as for the parent claim.

Regarding claim 3,
Lee, Konidaris, and Juba jointly teach The method of claim 2, further comprising:
Lee further teaches:
initializing a replay buffer for the particular one of the [parameterized] options, (Lee Pg. 5, Algorithm 1 instructs on line 1 to “Initialize trajectory buffer B”, and on line 14 to store several variables in a tuple with a particular option O, Juba but not Lee teaches parameters for options)
wherein the training for the particular one of the [parameter] options includes storing, within the replay buffer and for a particular action from the intra-option policy for the particular one of the [parameterized] options, data… (Lee Pg. 5, Algorithm 1 instructs on line 14 to store several variables associated with a particular option O to buffer B, with other variables in the tuple being a, standing for an action, which is sampled at line 11 from intra-option policy πOo, Juba but not Lee teaches parameters for options)
data including: an initial state, the particular action, a reward… (Lee Pg. 5, Algorithm 1 instructs on line 14 to store several variables associated with a particular option O to buffer B, with other variables in the tuple being current state s, action a, and intrinsic reward ri)
and the intra-option policy for the particular one of the parameterized options is updated using the data ((Lee Pg. 5) “In the rollout phase, the HRL agent samples an option Oo using the SMDP policy µ. If the Oo was never selected before, we create the option and initialize the policy πOo and add it to a container. Next, the sample trajectories are generated by using the πOo until it terminates. After sampling one-step state transition, we compute the intrinsic reward following Definition 7. Then, the HRL agent updates the option policy and the SMDP policy using the samples stored in the buffer in the training phase”)
Konidaris further teaches:
data including:…[the particular action, a reward,] a subsequent state,… ((Konidaris Pg. 1) “A single Markov decision process is defined by a tuple <S, A, R, T, γ>, where S is a state space, A is a set of available actions, R(s, a, s’) defines the reward obtained when executing action a in state s and transitioning to state s’”, Lee also teaches a particular action and a reward)
Juba further teaches:
data including:…and one or more values associated with the one or more option parameters, ((Juba Pg. 2) “A lifted action A ∈ 𝒜 is a pair <name, params> where name is a symbol and params is a list of types, denoted name(A) and params(A), respectively, and arity(A, t) denotes the number of type-t parameters. The action model M for a set of actions A is a pair of functions preM and effM that map every action in 𝒜 to its preconditions and effects…A parameter-bound literal L for the lifted action A is a pair of the form <L, bL,A> where b is a parameter binding of L and A. preM(A) and effM(A) are sets of parameter-bound literals for A”, a set of actions corresponds to an option, a set of parameter-bound literals for action A corresponds to values for parameters for part of the option)
At the time of filing, one of ordinary skill in the art would have motivation to combine the method jointly taught by Lee, Konidaris, and Juba for the parent claim of claim 3, claim 2. No new embodiments are introduced, so the reason to combine is the same as for the parent claim.

Regarding claim 6,
Lee, Konidaris, and Juba jointly teach The method of claim 1, 
Lee further teaches:
wherein the mapping is generated using planning annotated reinforcement learning (PaRL) ((Lee Pg. 3) “A PaRL task is a triple E := <M, Π, L>, where M := <S, A, P, r, s0, G, γ> is a goal-oriented MDP over RL states S, Π := <V, O, s’0, s∗> is a planning task over planning states S, and L : S → S’ is a surjective mapping from the states of the MDP S to planning task S’ satisfying s’0 = L(s0) and s∗ consistent with L(s) for all s ∈ G”, the PaRL task includes a mapping L between MDP S states and planning task S’)
At the time of filing, one of ordinary skill in the art would have motivation to combine the method jointly taught by Lee, Konidaris, and Juba for the parent claim of claim 6, claim 1. No new embodiments are introduced, so the reason to combine is the same as for the parent claim.

Regarding claim 7,
Lee, Konidaris, and Juba jointly teach The method of claim 1, 
Lee further teaches:
wherein the planning domain includes a plurality of options for performing the goal ((Lee Pg. 4) “For any pair of initial state s0 ∈ S and a goal sg ∈ G in M, we can generate a sequence of options {Oo1, Oo2, ..., Ook} from a plan in Π that reaches the goal state L(sg) ∈ S’ from the initial state L(s0) ∈ S’”)
At the time of filing, one of ordinary skill in the art would have motivation to combine the method jointly taught by Lee, Konidaris, and Juba for the parent claim of claim 7, claim 1. No new embodiments are introduced, so the reason to combine is the same as for the parent claim.

Regarding claims 8-10, 13, and 14,
	Claims 8-10, 13, and 14 recite a computer hardware system for performing the function of the method of claims 1-3, 6, and 7, respectively. Specifically, claim 8 recites A computer hardware system for generating a policy for performing a goal and including a plurality of intra-option policies, comprising: a hardware processor configured to perform the following executable operations: [the method of claim 1]. Lee recites: (Lee Pg. 14) “We used two types of hardwares for evaluation. For the N-rooms and logistics domain, we used CPU only machines with 56 cores of 2 GHz CPUs and for the Montezuma’s revenge domain, we used P100 GPU for the training”.
All other limitations in claims 8-10, 13, and 14 are substantially the same as those in claims 1-3, 6, and 7, respectively, therefore the same rationale for rejection applies.

Regarding claims 15-17 and 20,
	Claims 15-17 and 20 recite a computer program product comprising a computer-readable storage medium storing program code for performing the function of the method of claims 1-3 and 6, respectively. Specifically, claim 15 recites A computer program product, comprising: a computer readable storage medium having stored therein program code for generating a policy for performing a goal and including a plurality of intra-option policies, the program code, which when executed by a computer hardware system, causes the computer hardware system to perform: [the method of claim 1]. Lee recites: (Lee Pg. 14) “We implemented Algorithm 2 using python language by extending RL agents in stable-baselines 3 and using pyperplan as AI planner…For the N-rooms and logistics domain, we used CPU only machines with 56 cores of 2 GHz CPUs and for the Montezuma’s revenge domain, we used P100 GPU for the training”, python language is program code, and CPUs and GPUs inherently include computer-readable memory components.
All other limitations in claims 15-17 and 20 are substantially the same as those in claims 1-3 and 6, respectively, therefore the same rationale for rejection applies.

Claims 4, 5, 11, 12, 18, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Lee, in view of Konidaris, further in view of Juba, further in view of Horcik and Fiser “Endomorphisms of Lifted Planning Problems”, hereinafter Horcik.

Regarding claim 4,
Lee, Konidaris, and Juba jointly teach The method of claim 1, 
Lee further teaches:
wherein the MDP [distribution] defines an environment ((Lee Pg. 5) “To empirically evaluate the proposed approach, we conducted experiments on the three goal oriented MDP environments”, Konidaris but not Lee teaches a distribution for MDPs) including a plurality of MDP meeting constraints, ((Lee Pg. 8) “We design a general method for injecting intrinsic rewards to RL agents from the abstract planning task by reformulating the underlying decomposed sub-MDPs with constraints visible to planning agents”)
Horcik teaches the following further limitation that Konidaris does not teach, and more explicitly than Lee or Juba teaches:
and the constraints including a predicate, action, object, type, and action model ((Horcik Pg. 2) “Definition 2. A normalized PDDL task is a tuple 𝒫 = <B, T, V, P, A, ψI, ψG> where B is a non-empty set of objects, T is a type hierarchy over B…P is a set of predicate symbols…An action schema a(x-arrow) ∈ A is a tuple a = <prea(x-arrow), adda(x-arrow), dela(x-arrow)> where prea(x-arrow), adda(x-arrow), and dela(x-arrow) are sets of atoms, called preconditions, add effects, and delete effects, respectively, and x-arrow is a tuple of variables occurring in any atom in a(x-arrow). If we substitute a tuple of objects…for x, we create a ground action (or shortly just an action). The resulting ground action is denoted by a(b-arrow)”, according to Applicant’s specification at [0041] “Lifted Action Model (LAM) (also known as an action schema)”, therefore an action schema is an action model)
At the time of filing, one of ordinary skill in the art would have motivation to combine Lee, Konidaris, Juba, and Horcik by taking the method of claim 1 for training intra-option policies using mappings from MDP states, from MDPs in a distribution, and planning states, and including defining an environment with constraints with the MDP distribution, jointly taught by Lee, Konidaris, and Juba, and additionally including the constraints being specifically a predicate, a type, an action, a type, and an action model, taught by Horcik, as Horcik teaches: (Horcik Pg. 1) “Classical planning tasks are usually modeled in the standard PDDL language based on first-order logic”, with the constraints being part of the PDDL (Planning Domain Definition) language, doing so is a well-known technique for outlining a planning problem in a standardized and human-readable fashion, imparting the predictable benefit of being known and easy to work with for experts in the field, increasing the ease of performing the method. Such a combination would be obvious.

Regarding claim 5,
Lee, Konidaris, Juba, and Horcik jointly teach The method of claim 4, 
Lee further teaches:
wherein the policy for performing the goal is configured to be used with a second MDP that meets the constraints ((Lee Pg. 8) “We design a general method for injecting intrinsic rewards to RL agents from the abstract planning task by reformulating the underlying decomposed sub-MDPs with constraints visible to planning agents. Learning only the (intra-option) policies for these sub-MDPs is shown to work well in practice on various problems, significantly improving sample efficiency”, a sub-MDP is a second MDP)
At the time of filing, one of ordinary skill in the art would have motivation to combine the method jointly taught by Lee, Konidaris, Juba, and Horcik for the parent claim of claim 5, claim 4. No new embodiments are introduced, so the reason to combine is the same as for the parent claim.

Regarding claims 11 and 12,
Claims 11 and 12 recite a computer program product comprising a computer-readable storage medium storing instructions for performing the function of the method of claims 4 and 5, respectively. All other limitations in claims 11 and 12 are substantially the same as those in claims 4 and 5, respectively, therefore the same rationale for rejection applies.

Regarding claims 18 and 19,
Claims 18 and 19 recite a computer hardware system for performing the function of the method of claims 4 and 5, respectively. All other limitations in claims 18 and 19 are substantially the same as those in claims 4 and 5, respectively, therefore the same rationale for rejection applies.









Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Fiser “Lifted Fact-Alternating Mutex Groups and Pruned Grounding of Classical Planning Problems” discloses a proof of characteristics of lifted mutex groups and applies this to provide a more efficient method of translation to a grounded representation.
Sutton et al. “Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning” discloses planning with options and intra-option methods for reinforcement learning with Markov Decision Processes and Semi-Markov Decision Processes.
Xu et al. “Hierarchical Reinforcement Learning in StarCraft II with Human Expertise in Subgoals Selection” discloses a method for teaching agents to play a video game using Hierarchical Reinforcement Learning.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to VICTOR A NAULT whose telephone number is (703) 756-5745. The examiner can normally be reached M - F, 12 - 8.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached at (571) 270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/V.A.N./Examiner, Art Unit 2124                                                                                                                                                                                                        

/Kevin W Figueroa/Primary Examiner, Art Unit 2124
Read full office action
Prosecution Timeline

May 05, 2023
Application Filed
Mar 09, 2026
Non-Final Rejection — §101, §103, §112
Apr 08, 2026
Examiner Interview Summary
Apr 08, 2026
Applicant Interview (Telephonic)
Precedent Cases

Applications granted by this same examiner with similar technology

17/571,899
Patent 12579429
DEEP LEARNING BASED EMAIL CLASSIFICATION
2y 5m to grant Granted Mar 17, 2026
17/663,579
Patent 12566953
AUTOMATED PROCESSING OF FEEDBACK DATA TO IDENTIFY REAL-TIME CHANGES
2y 5m to grant Granted Mar 03, 2026
17/730,413
Patent 12561563
AUTOMATED PROCESSING OF FEEDBACK DATA TO IDENTIFY REAL-TIME CHANGES
2y 5m to grant Granted Feb 24, 2026
17/517,313
Patent 12468939
OBJECT DISCOVERY USING AN AUTOENCODER
2y 5m to grant Granted Nov 11, 2025
17/578,759
Patent 12446600
TWO-STAGE SAMPLING FOR ACCELERATED DEFORMULATION GENERATION
2y 5m to grant Granted Oct 21, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
62%
Grant Probability
99%
With Interview (+83.3%)
3y 11m
Median Time to Grant
Low
PTA Risk
Based on 13 resolved cases by this examiner. Grant probability derived from career allow rate.