Last updated: May 29, 2026
Application No. 17/969,129
PESSIMISTIC OFFLINE REINFORCEMENT LEARNING SYSTEM AND METHOD

Final Rejection §101§103
Filed
Oct 19, 2022
Priority
Jul 29, 2022 — provisional 63/393,600
Examiner
ILES, TYLER EDWARD
Art Unit
2122
Tech Center
2100 — Computer Architecture & Software
Assignee
DENSO CORPORATION
OA Round
2 (Final)
This examiner grants 60% of cases after interview

— +50.0% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 5 resolved cases, 2023–2026
Examiner Intelligence

ILES, TYLER EDWARD View full profile →
Grants 60% of resolved cases
Career Allowance Rate
3 granted / 5 resolved
+5.0% vs TC avg
Strong +50% interview lift
Without
With
+50.0%
Interview Lift
resolved cases with interview
Typical timeline
3y 8m
Avg Prosecution
9 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
13.2%
-26.8% vs TC avg
§103
84.2%
+44.2% vs TC avg
§102
2.6%
-37.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 5 resolved cases
Office Action

§101 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . This action is in response to an amendment filed on December 30th, 2025. Claims 1-5, 8-12, and 15-19 are pending in the current application.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

	Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.

	Regarding claim 1, Under Step 1 of the Subject Matter Eligibility Test of Products and Processes, the claim is directed towards a process which is one of the four statutory categories.
Next, under a Step 2A Prong 1 Analysis, the claim recites:
sampling states over a whole state space
determining when sampled states are out of distribution out of distribution being a state that is not within the dataset (D)
assigning probability weights to the sampled states that are out of distribution, the probability weights assigned to the sampled states that are out of distribution are greater than probability weights assigned to in-distribution states that are within the dataset (D)
Updating the Q-function to generate a fitted Q-function solving an optimization problem with a minimization term and a maximization term, wherein the minimization term penalizes an overall expected reward when a present state is out of distribution and the maximization term that cancels the minimization term when the present state is one of the in- distribution states
estimating a Q-value using the fitted Q-function by estimating the overall expected reward assuming the agent is in the present state and performs a present action
The “sampling states over a whole state space”, “determining when sampled states are out of distribution, out of distribution being a state that is not within the dataset (D)”, and “assigning probability weights to the sampled states that are out of distribution, the probability weights assigned to the sampled states that are out of distribution are greater than probability weights assigned to in-distribution states that are within the dataset (D)”, are processes that, under the broadest reasonable interpretation, falls under the “mental processes” grouping of abstract ideas, and the “Updating the Q-function to generate a fitted Q-function by solving an optimization problem with a minimization term and a maximization term, wherein the minimization term penalizes an overall expected reward when a present state is out of distribution and the maximization term that cancels the minimization term when the present state is one of the in-distribution states”, and “estimating a Q-value using the fitted Q-function by estimating the overall expected reward assuming the agent is in the present state and performs a present action”, are processes that fall under the “mathematical concepts” grouping of abstract ideas.
Therefore, we have to examine the claim under Step 2A prong 2, which considers the additional elements within the claim. The claim’s additional elements are:
performing offline reinforcement learning to iteratively update an agent that possesses a policy and a Q-function based on a dataset (D), the dataset (D) having in- distribution states 
updating the policy according to an existing reinforcement learning algorithm.
controlling a robotic device using the policy
“performing offline reinforcement learning to iteratively update an agent that possesses a policy and a Q-function based on a dataset, the dataset (D) having in- distribution states”, and “updating the policy according to an existing reinforcement learning algorithm” merely indicates the field of use and technological environment in which to apply the judicial exception, “generally linking” it, i.e. the abstract idea generically performing offline RL learning to update an agent using a Q-function based on a dataset with in-distribution states and out-of-distribution states. (See MPEP 2106.05(h)) The limitation, “controlling a robotic device using the policy”, is considered to be insignificant extra-solution activity. (See MPEP2106.05(g)) Therefore, these additional elements do not integrate the abstract idea into a practical application. The claim is directed to an abstract idea.
Under a Step 2B analysis, the claim’s addition elements do not amount to significantly
more than the judicial exception as explained above in Step 2A prong 2. Additionally, “controlling a robotic device using the policy” is considered well-understood, routine, and conventional, as evidenced by Khan et al. (From Graph Policy Gradients for Large Scale Robot Control, “In the recent past, deep learning has proved to be an extremely valuable tool for robotics. Harnessing the power of deep neural networks has emerged as a successful approach to designing policies that map sensor inputs to control outputs for complex tasks.”,  pg. 1, first paragraph under “Introduction”) (Policies for robots have been well-understood, routine, and conventional at least since 2021.) Therefore, the claim is ineligible.

	Regarding claim 8, Under Step 1 of the Subject Matter Eligibility Test of Products and Processes, the claim is directed towards a system which is considered a machine, which is one of the four statutory categories.
Next, under a Step 2A Prong 1 Analysis, the claim recites:
sample states over a whole state space
determine when sampled states are out of distribution, out of distribution being a state that is not within the dataset (D)
assign probability weights to the sampled states that are out of distribution, the probability weights assigned to the sampled states that are out of distribution are greater than probability weights assigned to in-distribution states that are within the dataset (D)
update the Q-function to generate a fitted Q-function by solving an optimization problem with a minimization term and a maximization term, wherein the minimization term penalizes an overall expected reward when a present state is out of distribution and the maximization term that cancels the minimization term when the present state is one of the in- distribution states
estimate a Q-value using the fitted Q-function by estimating the overall expected reward assuming the agent is in the present state and performs a present action
To “sample states over a whole state space”, “determine when sampled states are out of distribution, out of distribution being a state that is not within the dataset (D)”, and “assign probability weights to the sampled states that are out of distribution, the probability weights assigned to the sampled states that are out of distribution are greater than probability weights assigned to in-distribution states that are within the dataset (D)”, are processes that, under the broadest reasonable interpretation, falls under the “mental processes” grouping of abstract ideas, and to “update the Q-function to generate a fitted Q-function by solving an optimization problem with a minimization term and a maximization term, wherein the minimization term penalizes an overall expected reward when a present state is out of distribution and the maximization term that cancels the minimization term when the present state is one of the in-distribution states”, and “estimate a Q-value using the fitted Q-function by estimating the overall expected reward assuming the agent is in the present state and performs a present action”, are processes that fall under the “mathematical concepts” grouping of abstract ideas.
Therefore, we have to examine the claim under Step 2A prong 2, which considers the additional elements within the claim. The claim’s additional elements are:
A processor
A memory in communication with the processor and storing an offline learning module
performing offline reinforcement learning to iteratively update an agent that possesses a policy and a Q-function based on a dataset (D), the dataset (D) having in- distribution states
and update the policy according to an existing reinforcement learning algorithm.
controlling a robotic device using the policy
The processor, memory, “performing offline reinforcement learning to iteratively update an agent that possesses a policy and a Q-function based on a dataset (D), the dataset (D) having in- distribution states”, and to “update the policy according to an existing reinforcement learning algorithm“ are limitations that, as drafted, merely indicates the field of use and technological environment in which to apply the judicial exception, “generally linking” the limitations to the judicial exception. (See MPEP 2106.05(h)) The limitation, “controlling a robotic device using the policy”, is considered to be insignificant extra-solution activity. (See MPEP2106.05(g)) Therefore, these additional elements do not integrate the abstract idea into a practical application. The claim is directed to an abstract idea.
Under a Step 2B analysis, the claim’s addition elements do not amount to significantly
more than the judicial exception as explained above in Step 2A prong 2. Additionally, “controlling a robotic device using the policy” is considered well-understood, routine, and conventional, as evidenced by Khan et al. (From Graph Policy Gradients for Large Scale Robot Control, “In the recent past, deep learning has proved to be an extremely valuable tool for robotics. Harnessing the power of deep neural networks has emerged as a successful approach to designing policies that map sensor inputs to control outputs for complex tasks.”,  pg. 1, first paragraph under “Introduction”) (Policies for robots have been well-understood, routine, and conventional at least since 2021.) Therefore, the claim is ineligible.

	Regarding claim 15, Under Step 1 of the Subject Matter Eligibility Test of Products and Processes, the claim is directed towards a non-transitory computer-readable medium which is considered a manufacture, which is one of the four statutory categories.
Next, under a Step 2A Prong 1 Analysis, the claim recites:
sample states over a whole state space
determine when sampled states are out of distribution, out of distribution being a state that is not within the dataset (D)
assign probability weights to the sampled states that are out of distribution, the probability weights assigned to the sampled states that are out of distribution are greater than probability weights assigned to in-distribution states that are within the dataset (D)
update the Q-function to generate a fitted Q-function by solving an optimization problem with a minimization term and a maximization term, wherein the minimization term penalizes an overall expected reward when a present state is out of distribution and the maximization term that cancels the minimization term when the present state is one of the in- distribution states
estimate a Q-value using the fitted Q-function by estimating the overall expected reward assuming the agent is in the present state and performs a present action
To “sample states over a whole state space” “determine when sampled states are out of distribution, out of distribution being a state that is not within the dataset (D)”, and “assign probability weights to the sampled states that are out of distribution the probability weights assigned to the sampled states that are out of distribution are greater than probability weights assigned to in-distribution states that are within the dataset (D) are processes that, under the broadest reasonable interpretation, fall under the “mental processes” grouping of abstract ideas, and the “update the Q-function to generate a fitted Q-function by solving an optimization problem with a minimization term and a maximization term, wherein the minimization term penalizes an overall expected reward when a present state is out of distribution and the maximization term that cancels the minimization term when the present state is one of the in-distribution states”, and “estimate a Q-value using the fitted Q-function by estimating the overall expected reward assuming the agent is in the present state and performs a present action”, are processes that fall under the “mathematical concepts” grouping of abstract ideas.
Therefore, we have to examine the claim under Step 2A prong 2, which considers the additional elements within the claim. The claim’s additional elements are:
A processor
performing offline reinforcement learning to iteratively update an agent that possesses a policy and a Q-function based on a dataset (D), the dataset (D) having in-distribution 
update the policy according to an existing reinforcement learning algorithm.
and control a robotic device using the policy
The processor, “performing offline reinforcement learning to iteratively update an agent that possesses a policy and a Q-function based on a dataset (D), the dataset (D) having in- distribution states”, and to “update the policy according to an existing reinforcement learning algorithm“ are limitations that, as drafted, merely indicates the field of use and technological environment in which to apply the judicial exception, “generally linking” the limitations to the judicial exception. (See MPEP 2106.05(h)) The limitation, “controlling a robotic device using the policy”, is considered to be insignificant extra-solution activity. (See MPEP2106.05(g)) Therefore, these additional elements do not integrate the abstract idea into a practical application. The claim is directed to an abstract idea.
Under a Step 2B analysis, the claim’s addition elements do not amount to significantly
more than the judicial exception as explained above in Step 2A prong 2. Additionally, “controlling a robotic device using the policy” is considered well-understood, routine, and conventional, as evidenced by Khan et al. (From Graph Policy Gradients for Large Scale Robot Control, “In the recent past, deep learning has proved to be an extremely valuable tool for robotics. Harnessing the power of deep neural networks has emerged as a successful approach to designing policies that map sensor inputs to control outputs for complex tasks.”,  pg. 1, first paragraph under “Introduction”) (Policies for robots have been well-understood, routine, and conventional at least since 2021.) Therefore, the claim is ineligible.

	Regarding claims 2, 9, 16, the claims recite “the fitted Q-function is a conservative Q-function that lower-bounds an actual Q-function corresponding to an underlying Markov Decision Process in the dataset (D).” As drafted, the limitation merely indicates the field of use and technological environment of the judicial exception, “generally linking” the limitation to the judicial exception. (See MPEP 2106.05(h)) Therefore, the claims are not eligible under U.S.C. 101 for the same reasons as set forth in the rejection of claims 1, 8, and 15.

	Regarding claims 3, 10, 17, the claims recite “training a bag of dynamics models (P1,P2,...,P) according to transitions within a dataset ((D)) to output an uncertainty estimation model that indicates when the present state is the out of distribution state. As drafted, the limitation is considered to be mere instructions to apply the judicial exception. (See MPEP 2106.05(f)) Therefore, the claims are not eligible under U.S.C. 101 for the same reasons as set forth in the rejection of claims 1, 8, and 15.

Regarding claims 4, 11, 18, the claims recite the minimizing term is expressed as:

    PNG
    media_image1.png
    40
    193
    media_image1.png
    Greyscale

wherein s is the present state, a is the action,             
                
                                π
                            
                            ^
                        
                        k
                    
         the policy, Q is the Q-value, and             
                
                        ⅆ
                    
                        ϕ
                    
         is a distribution that assigns probabilities to states outside the dataset (D); and the maximizing term is expressed as 

    PNG
    media_image2.png
    33
    205
    media_image2.png
    Greyscale

wherein s is the present state, a is the action,             
                
                                π
                            
                            ^
                        
                        k
                    
         the policy, Q is the Q-value, and             
                
                        ⅆ
                    
                        ϕ
                    
         is a marginal distribution of states in the dataset (D). The limitation as drafted is considered to be a “mathematical concept”, which is an abstract idea. Therefore, the claims are not eligible under U.S.C. 101 for the same reasons as set forth in the rejection of claims 1, 8, and 15.

Regarding claims 5, 12, 19, the claims recite the fitted Q-function further includes a regularization term. The limitation as drafted is considered to be a “mathematical concept”, which is an abstract idea. Therefore, the claims are not eligible under U.S.C. 101 for the same reasons as set forth in the rejection of claims 4, 11, and 18.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-5, 7-12, and 14-19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Sergey Levine et al. (Herein referred to as Levine) (Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems) (Cited in the IDS filed on October 19th, 2022) in view of Chenjia Bai et al. (Herein referred to as Bai) (PESSIMISTIC BOOTSTRAPPING FOR UNCERTAINTY DRIVEN OFFLINE REINFORCEMENT LEARNING)

Regarding claim 1, Levine teaches a method for performing offline reinforcement learning to iteratively update an agent that possesses a policy and a Q-function based on a dataset (D) (“This Bellman operator has a unique fixed point that corresponds to the true Q-function for the policy π(a|s)… the learning algorithm is provided with a static dataset of transitions, D… and must learn the best policy it can using this dataset.”, pg. 5, fourth paragraph; pg. 7, under “2.2 Offline Reinforcement Learning”) the dataset (D) having in- distribution states (“If all states and actions fed into the Q-function for target value calculations are always in-distribution with respect to the Q-function training set, errors in the Q-function should not accumulate, and standard generalization results from supervised learning should apply.”, pgs. 19-20, under “4.3 Policy Constraints for Off-Policy Evaluation and Improvement”) the method comprising steps of: sampling states over a whole state space (“…expectations under state distributions from d πβ (s) can be calculated easily by sampling states from the dataset D in the offline case, removing the need for importance sampling.”, pgs. 12-13, under “3.3 Approximate Off-Policy Policy Gradients”) determining when sampled states are out of distribution, out of distribution being a state that is not within the dataset (D) (“the learned policy π(a|s) may enter into states that are far outside of its training distribution, since d π (s) may be very different from d πβ (s). In these out-of-distribution states, the generalization error bound no longer holds, since standard empirical risk minimization makes no guarantees about error when encountering out-of-distribution inputs that were not seen during training.”, pg. 10, fifth paragraph) updating the Q-function to generate a fitted Q-function by solving an optimization problem with a minimization term and a maximization term, wherein the minimization term penalizes an overall expected reward when a present state is out of distribution and the maximization term that cancels the minimization term when the present state is one of the in- distribution states (“Intuitively, this penalty minimizes the Q values at all of the states in the buffer, for actions selected according to the distribution µ(a|s)… A simple modification, which we refer to as CCQL1 (B, φ), is to also add a value maximization term to balance out the minimization term under µ(a|s)…”, pg. 24, first and third paragraph, See also Levine’s Equation 1 below) estimating a Q-value using the fitted Q-function by estimating the overall expected reward assuming the agent is in the present state and performs a present action (“A value function provides an estimate of the expected cumulative reward that will be obtained by following some policy π(at|st) when starting from a given state st, in the case of the state-value function V π (st), or when starting from a state-action tuple (st, at), in the case of the state-action value function Qπ (st, at).”, pg. 5, first paragraph, under “Approximate dynamic programming.”) updating the policy according to an existing reinforcement learning algorithm, (“We will use D… to denote the set of transitions that are available for the agent to use for updating the policy (“learning”), which may consist of either all transitions seen so far, or some subset thereof.”, pg. 4, second paragraph) and controlling a robotic device using the policy. (“Pure model-based methods in robotics typically involve training a model on real or simulated data, and then planning within the model to produce a policy that is executed on a real system.”, pg. 30, bottom paragraph of “6.2 Applications in Robotics”)
However, Levine does not teach assigning probability weights to the sampled states that are out of distribution the probability weights assigned to the sampled states that are out of distribution are greater than probability weights assigned to in-distribution states that are within the dataset.
Bai teaches assigning probability weights to the sampled states that are out of distribution the probability weights assigned to the sampled states that are out of distribution are greater than probability weights assigned to in-distribution states that are within the dataset (“The OOD datapoint consists of states sampled from the training buffer, the corresponding OOD actions sampled from the current policy, and the corresponding OOD target based on the estimated Q-function and uncertainty quantification… Specifically, we sample OOD states from the in-distribution dataset Din.”, pg. 2, second paragraph; pg. 4, third and fourth paragraph) (The uncertain quantification of Bai indicates, and corresponds to a probability weight which are assigned to specifically sampled out-of-distribution states. In combination with Levine, which assigns greater weight to OOD state-action pairs, it would be easy to configured the combination to assign greater weight to the OOD states of Bai using the method of doing so, as disclosed by Levine.)
Therefore, it would have been considered obvious to one of ordinary skill in the art, prior to the filing date of the current application, to combine the offline reinforcement learning of Levine with the sampling and weighting of out-of-distribution states, as disclosed by Bai. One would have been motivated to combine the teaching, prior to the application’s filing date, as this allows for pessimistic bootstrapping reinforcement learning, which yields better performance, as disclosed in Bai. (“PBRL conducts uncertainty quantification via the disagreement of bootstrapped Q-functions, and performs pessimistic updates by penalizing the value function based on the estimated uncertainty. To tackle the extrapolating error, we further propose a novel OOD sampling method. We show that such OOD sampling and pessimistic bootstrapping yields a provable uncertainty quantifier in linear MDPs, thus providing the theoretical underpinning for PBRL. Extensive experiments on D4RL benchmark show that PBRL has better performance compared to the state-of-the-art algorithms.”, pg. 1, Abstract)

	Regarding claim 8, Levine teaches a system for performing offline reinforcement learning to iteratively update an agent that possesses a policy and a Q-function based on a dataset (D), (“This Bellman operator has a unique fixed point that corresponds to the true Q-function for the policy π(a|s)… the learning algorithm is provided with a static dataset of transitions, D… and must learn the best policy it can using this dataset.”, pg. 5, paragraph 4; pg. 7, under “2.2 Offline Reinforcement Learning”) the dataset (D) having in- distribution states, (“If all states and actions fed into the Q-function for target value calculations are always in-distribution with respect to the Q-function training set, errors in the Q-function should not accumulate, and standard generalization results from supervised learning should apply.”, pgs. 19-20, under “4.3 Policy Constraints for Off-Policy Evaluation and Improvement”) the system comprising: a processor; and a memory in communication with the processor and storing an offline learning module (Despite Levine never explicitly mentioning these components, you would implicitly need these components to execute Levine’s method.) instructions that, when executed by the processor, cause the processor to: sample states over a whole state space, (“…expectations under state distributions from d πβ (s) can be calculated easily by sampling states from the dataset D in the offline case, removing the need for importance sampling.”, pgs. 12-13, under “3.3 Approximate Off-Policy Policy Gradients”) determine when sampled states are out of distribution, out of distribution being a state that is not within the dataset (D), (“the learned policy π(a|s) may enter into states that are far outside of its training distribution, since d π (s) may be very different from d πβ (s). In these out-of-distribution states, the generalization error bound no longer holds, since standard empirical risk minimization makes no guarantees about error when encountering out-of-distribution inputs that were not seen during training.”, pg. 10, fifth paragraph) update the Q-function to generate a fitted Q-function by solving an optimization problem with a minimization term and a maximization term, wherein the minimization term penalizes an overall expected reward when a present state is out of distribution and the maximization term that cancels the minimization term when the present state is one of the in-distribution states, (“Intuitively, this penalty minimizes the Q values at all of the states in the buffer, for actions selected according to the distribution µ(a|s)… A simple modification, which we refer to as CCQL1 (B, φ), is to also add a value maximization term to balance out the minimization term under µ(a|s)…”, pg. 24, first and third paragraph, See also Levine’s Equation 1 below) (The penalty applies to all states in the buffer, which would include the present state.) estimate a Q-value using the fitted Q-function by estimating the overall expected reward assuming the agent is in the present state and performs a present action, (“A value function provides an estimate of the expected cumulative reward that will be obtained by following some policy π(at|st) when starting from a given state st, in the case of the state-value function V π (st), or when starting from a state-action tuple (st, at), in the case of the state-action value function Qπ (st, at).”, pg. 5, first paragraph, under “Approximate dynamic programming”) update the policy according to an existing reinforcement learning algorithm, (“We will use D… to denote the set of transitions that are available for the agent to use for updating the policy (“learning”), which may consist of either all transitions seen so far, or some subset thereof.”, pg. 4, second paragraph) and control a robotic device using the policy. (“Pure model-based methods in robotics typically involve training a model on real or simulated data, and then planning within the model to produce a policy that is executed on a real system.”, pg. 30, bottom paragraph of “6.2 Applications in Robotics”)
However, Levine does not teach to assign probability weights to the sampled states that are out of distribution the probability weights assigned to the sampled states that are out of distribution are greater than probability weights assigned to in-distribution states that are within the dataset.
Bai teaches to assign probability weights to the sampled states that are out of distribution the probability weights assigned to the sampled states that are out of distribution are greater than probability weights assigned to in-distribution states that are within the dataset (“The OOD datapoint consists of states sampled from the training buffer, the corresponding OOD actions sampled from the current policy, and the corresponding OOD target based on the estimated Q-function and uncertainty quantification… Specifically, we sample OOD states from the in-distribution dataset Din.”, pg. 2, second paragraph; pg. 4, third and fourth paragraph) (The uncertain quantification of Bai indicates, and corresponds to a probability weight which are assigned to specifically sampled out-of-distribution states. In combination with Levine, which assigns greater weight to OOD state-action pairs, it would be easy to configured the combination to assign greater weight to the OOD states of Bai using the method of doing so, as disclosed by Levine.)
Therefore, it would have been considered obvious to one of ordinary skill in the art, prior to the filing date of the current application, to combine the offline reinforcement learning of Levine with the sampling and weighting of out-of-distribution states, as disclosed by Bai. One would have been motivated to combine the teaching, prior to the application’s filing date, as this allows for pessimistic bootstrapping reinforcement learning, which yields better performance, as disclosed in Bai. (“PBRL conducts uncertainty quantification via the disagreement of bootstrapped Q-functions, and performs pessimistic updates by penalizing the value function based on the estimated uncertainty. To tackle the extrapolating error, we further propose a novel OOD sampling method. We show that such OOD sampling and pessimistic bootstrapping yields a provable uncertainty quantifier in linear MDPs, thus providing the theoretical underpinning for PBRL. Extensive experiments on D4RL benchmark show that PBRL has better performance compared to the state-of-the-art algorithms.”, pg. 1, Abstract)

	Regarding claim 15, Levine teaches a non-transitory computer-readable medium storing instructions for performing offline reinforcement learning (Despite Levine never explicitly mentioning a non-transitory computer-readable medium, you would implicitly need one to distribute the method of Levine.) to iteratively update an agent that possesses a policy and a Q-function based on a dataset (D) (“This Bellman operator has a unique fixed point that corresponds to the true Q-function for the policy π(a|s)… the learning algorithm is provided with a static dataset of transitions, D… and must learn the best policy it can using this dataset.”, pg. 5, paragraph 4; pg. 7, under “2.2 Offline Reinforcement Learning”) the dataset (D) having in-distribution states, (“If all states and actions fed into the Q-function for target value calculations are always in-distribution with respect to the Q-function training set, errors in the Q-function should not accumulate, and standard generalization results from supervised learning should apply.”, pgs. 19-20, under “4.3 Policy Constraints for Off-Policy Evaluation and Improvement”) the instructions, when executed by a processor, cause the processor to: sample states over a whole state space (“…expectations under state distributions from d πβ (s) can be calculated easily by sampling states from the dataset D in the offline case, removing the need for importance sampling.”, pgs. 12-13, under “3.3 Approximate Off-Policy Policy Gradients”) determine when sampled states are out of distribution, out of distribution being a state that is not within the dataset (D), (“the learned policy π(a|s) may enter into states that are far outside of its training distribution, since d π (s) may be very different from d πβ (s). In these out-of-distribution states, the generalization error bound no longer holds, since standard empirical risk minimization makes no guarantees about error when encountering out-of-distribution inputs that were not seen during training.”, pg. 10, fifth paragraph) update the Q-function to generate a fitted Q-function by solving an optimization problem with a minimization term and a maximization term, wherein the minimization term penalizes an overall expected reward when a present state is out of distribution and the maximization term that cancels the minimization term when the present state is one of the in-distribution states, (“Intuitively, this penalty minimizes the Q values at all of the states in the buffer, for actions selected according to the distribution µ(a|s)… A simple modification, which we refer to as CCQL1 (B, φ), is to also add a value maximization term to balance out the minimization term under µ(a|s)…”, pg. 24, first and third paragraph, See also Levine’s Equation 1 below) estimate a Q-value using the fitted Q-function by estimating the overall expected reward assuming the agent is in the present state and performs a present action, (“A value function provides an estimate of the expected cumulative reward that will be obtained by following some policy π(at|st) when starting from a given state st, in the case of the state-value function V π (st), or when starting from a state-action tuple (st, at), in the case of the state-action value function Qπ (st, at).”, pg. 5, first paragraph, under “Approximate dynamic programming”) update the policy according to an existing reinforcement learning algorithm, (“We will use D… to denote the set of transitions that are available for the agent to use for updating the policy (“learning”), which may consist of either all transitions seen so far, or some subset thereof.”, pg. 4, second paragraph) and control a robotic device using the policy. (“Pure model-based methods in robotics typically involve training a model on real or simulated data, and then planning within the model to produce a policy that is executed on a real system.”, pg. 30, bottom paragraph of “6.2 Applications in Robotics”)
However, Levine does not teach to assign probability weights to the sampled states that are out of distribution the probability weights assigned to the sampled states that are out of distribution are greater than probability weights assigned to in-distribution states that are within the dataset. 
Bai teaches to assign probability weights to the sampled states that are out of distribution the probability weights assigned to the sampled states that are out of distribution are greater than probability weights assigned to in-distribution states that are within the dataset.(“The OOD datapoint consists of states sampled from the training buffer, the corresponding OOD actions sampled from the current policy, and the corresponding OOD target based on the estimated Q-function and uncertainty quantification… Specifically, we sample OOD states from the in-distribution dataset Din.”, pg. 2, second paragraph; pg. 4, third and fourth paragraph) (The uncertain quantification of Bai indicates, and corresponds to a probability weight which are assigned to specifically sampled out-of-distribution states. In combination with Levine, which assigns greater weight to OOD state-action pairs, it would be easy to configured the combination to assign greater weight to the OOD states of Bai using the method of doing so, as disclosed by Levine.)
Therefore, it would have been considered obvious to one of ordinary skill in the art, prior to the filing date of the current application, to combine the offline reinforcement learning of Levine with the sampling and weighting of out-of-distribution states, as disclosed by Bai. One would have been motivated to combine the teaching, prior to the application’s filing date, as this allows for pessimistic bootstrapping reinforcement learning, which yields better performance, as disclosed in Bai. (“PBRL conducts uncertainty quantification via the disagreement of bootstrapped Q-functions, and performs pessimistic updates by penalizing the value function based on the estimated uncertainty. To tackle the extrapolating error, we further propose a novel OOD sampling method. We show that such OOD sampling and pessimistic bootstrapping yields a provable uncertainty quantifier in linear MDPs, thus providing the theoretical underpinning for PBRL. Extensive experiments on D4RL benchmark show that PBRL has better performance compared to the state-of-the-art algorithms.”, pg. 1, Abstract)

Regarding claims 2, 9, and 16, Levine, as modified by Bai, teaches the method, system, and medium of claims 1, 8, and 15 respectively, as well as, the fitted Q-function is a conservative Q-function that lower-bounds an actual Q-function corresponding to an underlying Markov Decision Process in the dataset (D). (“…a Q-function trained with this conservative penalty will represent a lower bound on the true Q-function… Besides the worst-case dependence on the horizon, an open question that still remains is the development of constraints that can effectively trade off error accumulation and suboptimality of the learned policy in most “average”-case MDP instances, and can be easily enforced and optimized in practice via standard optimization techniques without requiring additional function approximators to fit the behavior policy”, pg. 24, first paragraph; pg. 26, first paragraph) (This is all talked about under the section called “4.5 Conservative Q-Learning and Pessimistic Value Functions”, which mentions a Q-function as a lower bound, with the policy found within pertaining to an MDP in the dataset.)

Regarding claims 3, 10, and 17, Levine, as modified by Bai teaches the method, system, and medium of claims 1, 8, and 15 respectively, as well as training a bag of dynamics models (P1,P2,...,P) (“offline reinforcement learning employs a dataset D collected by some (potentially unknown) behavior policy πβ. The dataset is collected once, and is not altered during training, which makes it feasible to use large previous collected datasets. The training process does not interact with the MDP at all, and the policy is only deployed after being fully trained… Reinforcement learning addresses the problem of learning to control a dynamical system, in a general sense.”, pg. 2, Figure 1; pg. 3, under “2.1 Reinforcement Learning Preliminaries“) according to transitions within a dataset ((D)) to output an uncertainty estimation model that indicates when the present state is the out of distribution state. (“The intuition behind these uncertainty-based methods is that, if we can estimate the epistemic uncertainty of the Q-function, we expect this uncertainty to be substantially larger for out-of-distribution actions”, pg. 23, first paragraph under “4.4 Offline Approximate Dynamic Programming with Uncertainty Estimation”)

Regarding claims 4, 11, and 18, Levine, as modified by Bai teaches the method, system, and medium of claims 1, 8, and 15 respectively, as well as, the minimizing term is expressed as:

    PNG
    media_image1.png
    40
    193
    media_image1.png
    Greyscale

wherein s is the present state, a is the action,                         
                            
                                            π
                                        
                                        ^
                                    
                                    k
                                
                     the policy, Q is the Q-value, and                         
                            
                                    ⅆ
                                
                                    ϕ
                                
                    is a distribution that assigns probabilities to states outside the dataset (D); and the maximizing term is expressed as 

    PNG
    media_image2.png
    33
    205
    media_image2.png
    Greyscale

wherein s is the present state, a is the action,                         
                            
                                            π
                                        
                                        ^
                                    
                                    k
                                
                     the policy, Q is the Q-value, and                         
                            
                                    d
                                
                                            π
                                        
                                            β
                                        
                     is a marginal distribution of states in the dataset (D). (See Levine’s Equation 2 below, and on pg. 24.                         
                            μ
                            (
                            a
                            |
                            s
                            )
                        
                     is a distribution, and B is a batch of data.) (“This conservative penalty minimizes Q-values under the adversarially chosen µ(a|s) distribution, and maximizes the values for state-action tuples in the batch. Intuitively, this acts to ensure that high Q-values are only assigned to in-distribution actions. When µ(a|s) is equal to the behavior policy, the penalty is zero on average. While this penalty does not produce a Q-function that is a pointwise lower bound on the true Q-function, it is a lower bound in expectation under the current policy, thereby still providing appealing conservatism guarantees, while substantially reducing underestimation in practice. As shown by Kumar et al. (2020b), this variant produces the best performance in practice.”, pg. 24, fourth paragraph)

    PNG
    media_image3.png
    22
    346
    media_image3.png
    Greyscale

Levine’s Equation 1 on pg. 24

Regarding claims 5, 12, and 19, Levine, as modified by Bai, teaches the method, system, and medium of claims 1, 8, and 15 respectively, as well as the fitted Q-function further includes a regularization term (“Practical off-policy algorithms derived from such estimators can also employ regularization.... One example of such a regularizer is the soft max over the (unnormalized) importance weights (Levine and Koltun, 2013).”, pg. 12, fifth paragraph; See also Levine’s Equation 2 below and on pg. 12)

    PNG
    media_image4.png
    60
    531
    media_image4.png
    Greyscale

Levine’s Equation 2 on pg. 12

Response to Arguments
Applicant's arguments filed December 30th, 2025 have been fully considered but they are not persuasive. Claims 6, 13, and 20 were rejected under 101 and 103 in the previous action, but have been cancelled in the current application, and as such the rejections associated with those claims have been withdrawn. Claim 7 and 14 were rejected under 101 and 102 in the previous action, but have been cancelled in the current application, and as such the rejections associated with those claims have been withdrawn. In light of the amended claims and applicant arguments, the previous 112 rejection for claims 1, 3, 5, 6, 8, 12, 13, 15, 19, and 20 have been traversed, and the 112 rejection is withdrawn. The applicant argues in substance:

Argument 1: Claims 1, 8, and 15 are directed to a specific technological improvement and are not drawn to an abstract idea. Specifically, the claims are directed to an improvement in the functioning and safety of robotic systems trained from static datasets.
The examiner respectfully disagrees. The claims do not appear to recite any improvement in the functionality and safety of robotic systems as purported by the applicant. The claim language is seemingly directed towards the improvement of an abstract idea rather than an improvement in a specific technology and generally links the improvement of an abstract idea to the field of robotics systems. Furthermore, the examiner notes that the claim language of claims 1, 8 and 15 do not recite any training step that would indicate an improvement in training or deployment in the field of robotic learning systems.

Argument 2: The claimed operations cannot be performed in the human mind, and therefore fall outside the category of mental processes.
The examiner respectfully disagrees. According to MPEP 2106.04(a)(2), claims can recite a mental process if the process can be performed with physical aids, (such as pencil and paper) or on a generic computer, in a computing environment, or using the computer as a tool to perform the mental process. In this case, “sampling states over a whole state space”, “determining when sampled states are out of distribution out of distribution being a state that is not within the dataset (D)”, “assigning probability weights to the sampled states that are out of distribution”, and “controlling a robotic device using the policy”, are mental processes that can be performed in a computing environment. Therefore, the steps above fall inside the category of mental processes. It should be noted that limitations such as “Updating the Q-function to generate a fitted Q-function by solving an optimization problem with a minimization term and a maximization term, wherein the minimization term penalizes an overall expected reward when a present state is out of distribution and the maximization term that cancels the minimization term when the present state is one of the in-distribution states”, and “estimating a Q-value using the fitted Q-function by estimating the overall expected reward assuming the agent is in the present state and performs a present action”, are interpreted to be mathematical concepts more so than mental processes, but mathematical concepts are still considered to be a grouping of abstract idea. As such, the claims are directed toward an abstract idea, and the 101 rejections are maintained.

Argument 3: Levine does not teach updating the Q-function to generate a fitted Q-function by solving an optimization problem with a minimization term and a maximization term, wherein the minimization term penalizes an overall expected reward when a present state is out of distribution and the maximization term that cancels the minimization term when the present state is one of the in-distribution states. Specifically, the independent claims requires a two-part update rule, and adjusts Q-values based on the present state being in or out of distribution rather than an action being in or out of distribution.

Applicant’s arguments with respect to claim(s) 1-5, 7-12, and 14-19 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Tyler E Iles whose telephone number is (571)272-5442. The examiner can normally be reached 9:00am - 5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached at (571) 272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/T.E.I./Patent Examiner, Art Unit 2122                                                                                                                                                                                                        
/KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122
Read full office action
Prosecution Timeline

Oct 19, 2022
Application Filed
Oct 01, 2025
Non-Final Rejection mailed — §101, §103
Dec 30, 2025
Response Filed
May 13, 2026
Final Rejection mailed — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/559,396
Patent 12619883
SYSTEMS AND METHODS FOR DETERMINING TIME-SERIES FEATURE IMPORTANCE OF A MODEL
4y 4m to grant Granted May 05, 2026
Study what changed to get past this examiner. Based on 1 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
60%
Grant Probability
99%
With Interview (+50.0%)
3y 8m (~0m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 5 resolved cases by this examiner. Grant probability derived from career allowance rate.