Last updated: April 19, 2026
Application No. 18/046,564
METHOD AND DEVICE FOR REINFORCEMENT LEARNING

Final Rejection §101§102§103§112
Filed
Oct 14, 2022
Examiner
HAEFNER, KAITLYN RENEE
Art Unit
2148
Tech Center
2100 — Computer Architecture & Software
Assignee
Robert Bosch GmbH
OA Round
2 (Final)
This examiner grants 50% of cases after interview

— +66.7% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 4 resolved cases, 2023–2026
Examiner Intelligence

HAEFNER, KAITLYN RENEE View full profile →
Grants 50% of resolved cases
Career Allow Rate
2 granted / 4 resolved
-5.0% vs TC avg
Strong +67% interview lift
Without
With
+66.7%
Interview Lift
resolved cases with interview
Typical timeline
4y 2m
Avg Prosecution
32 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
32.6%
-7.4% vs TC avg
§103
31.1%
-8.9% vs TC avg
§102
13.8%
-26.2% vs TC avg
§112
22.2%
-17.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 4 resolved cases
Office Action

§101 §102 §103 §112
DETAILED ACTION
This action is in response to the amendment filed 02/03/2026 Claims 1-12 are pending and have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Examiner’s Note
Examiner notes that the independent claims recite two methods connected by an “or”. For the purposes of compact prosecution, Examiner has mapped to both methods recited in the independent claims.

Claim Objections
Claims 2 and 10 are objected to because of the following informalities: 
Regarding claims 2 and 10, claims 2 and 10 are missing indications of status. For purposes for examination, Examiner has interpreted claims 2 and 10 to be original. 
Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 5 and 9-10 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 5 recites “sampling parameters” in the first and second line. It is unclear as to whether these parameters are the same as the parameter values recited in claim 1 or if these parameters are referring to new parameters. For purposes of examination, Examiner has interpreted these parameters to be the same as the parameter values in claim 1.
Claim 5 recites “the parameter values that are sampled from the distribution” in the last line. There is insufficient antecedent basis for this limitation in the claim. It is unclear as to whether these parameter values are referencing the sampling parameters recited earlier in claim 5, or if these parameter values are referencing the parameter values from claim 1. For purposes of examination, Examiner has interpreted these parameter values to be the same as the parameters previously recited in claim 5. 
Claim 9 recites the limitation "parameters" in the last line.  There is insufficient antecedent basis for this limitation in the claim. Claim 1 recites “parameter values” and claim 9 recites “parameters.” It is unclear as to whether these parameters are the same as the parameter values recited in claim 1 or if these parameters are different than the previously recited parameter values. For purposes of examination, Examiner has interpreted these parameters to be the same as the parameter values recited in claim 1.
Claim 10 is rejected for at least the same reasons that claim 9 is rejected because claim 10 depends on claim 9.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-12 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.

Regarding Claim 1:
Subject Matter Eligibility Analysis Step 1:
	Claim 1 recites a method and is thus a process, one of the four statutory categories of patentable subject matter.
Subject Matter Eligibility Analysis Step 2A Prong 1:
	Claim 1 recites
determining a behavior policy depending on the policy; (This limitation is a mental process as it encompasses a human mentally determining a behavior policy.)
sampling a training data set with the behavior policy; (This limitation is a mental process as it encompasses a human mentally sampling a data set.)
determining an update for the parameter values with an objective function; wherein the objective function maps a difference between an estimate for an expected reward when following the policy and an estimate for a distance between the policy and the behavior policy, that depends on the policy and on the behavior policy, to the update, wherein the objective function corresponds to a lower bound that is determined from the training data (This limitation is a mental process as it encompasses a human mentally determining an update using an objective function.)
determining an update for the distribution with an objective function; wherein the objective function maps a difference between an expectancy value for an estimate for an expected reward when following the policy and an expectancy value for an estimate for a distance between the policy and the behavior policy, that depends on the policy and on the behavior policy, to the update, wherein the objective function corresponds to a lower bound that is determined from the training data. (This limitation is a mental process as it encompasses a human mentally determining an update using an objective function.)
Therefore, claim 1 recites an abstract idea.
Subject Matter Eligibility Analysis Step 2A Prong 2:
	Claim 1 further recites additional elements of
providing parameter values of a policy for reinforcement learning; (This element does not integrate the abstract idea into a practical application because it recites insignificant extra-solution activity of data gathering (see MPEP 2106.05(g)).)
providing a distribution for parameter values of a policy for reinforcement learning; (This element does not integrate the abstract idea into a practical application because it recites insignificant extra-solution activity of data gathering (see MPEP 2106.05(g)).)
Therefore, claim 1 is not integrated into a practical application.
Subject Matter Eligibility Analysis Step 2B:
	The additional elements of claim 1 do not provide significantly more than the abstract idea itself, taken alone and in combination because
providing parameter values of a policy for reinforcement learning is the well understood, routine, and conventional activity of “transmitting or receiving data over a network” (see MPEP 2106.05(d)(II); OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network)).
providing a distribution for parameter values of a policy for reinforcement learning is the well understood, routine, and conventional activity of “transmitting or receiving data over a network” (see MPEP 2106.05(d)(II); OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network)).
Therefore, claim 1 is subject-matter ineligible.

Regarding Claim 2:
Subject Matter Eligibility Analysis Step 2A Prong 1:
	Claim 2 recites
wherein the method further comprises determining the update for the distribution depending on a distribution that results in a value of the objective function that is larger than a value of the objective function that results for at least one other distribution. (This limitation is a mental process as it encompasses a human mentally determining the update based on a distribution.)
Therefore, claim 2 recites an abstract idea.
Subject Matter Eligibility Analysis Step 2A Prong 2:
	Claim 2 does not further recite any additional elements. Therefore, claim 2 is not integrated into a practical application.
Subject Matter Eligibility Analysis Step 2B:
	Since there are no additional elements, claim 2 does not provide significantly more than the abstract idea itself, taken alone and in combination. Therefore, claim 2 is subject-matter ineligible.

Regarding Claim 3:
Subject Matter Eligibility Analysis Step 2A Prong 1:
	Claim 3 recites
wherein the method further comprises determining the update for the distribution depending on the distribution that maximizes the value of the objective function. (This limitation is a mental process as it encompasses a human mentally determining the update based on a distribution.)
Therefore, claim 3 recites an abstract idea.
Subject Matter Eligibility Analysis Step 2A Prong 2:
	Claim 3 does not further recite any additional elements. Therefore, claim 3 is not integrated into a practical application.
Subject Matter Eligibility Analysis Step 2B:
	Since there are no additional elements, claim 3 does not provide significantly more than the abstract idea itself, taken alone and in combination. Therefore, claim 3 is subject-matter ineligible.

Regarding Claim 4:
Subject Matter Eligibility Analysis Step 2A Prong 1:
	Claim 4 recites
wherein the objective function includes a term that depends on a sum of the confidence parameter and a Kullback-Leibler divergence between the distribution and the reference distribution. (This limitation is a mental process as it further describes the mental process determining an update with an objective function.)
Therefore, claim 4 recites an abstract idea.
Subject Matter Eligibility Analysis Step 2A Prong 2:
	Claim 4 further recites additional elements of
wherein the method further comprises providing a reference distribution over the parameter values, and providing a confidence parameter, (This element does not integrate the abstract idea into a practical application because it recites insignificant extra-solution activity of data gathering (see MPEP 2106.05(g)).)
Therefore, claim 4 is not integrated into a practical application.
Subject Matter Eligibility Analysis Step 2B:
	The additional elements of claim 4 do not provide significantly more than the abstract idea itself, taken alone and in combination because
wherein the method further comprises providing a reference distribution over the parameter values, and providing a confidence parameter is the well understood, routine, and conventional activity of “transmitting or receiving data over a network” (see MPEP 2106.05(d)(II); OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network)).
Therefore, claim 4 is subject-matter ineligible.

	
Regarding Claim 5:
Subject Matter Eligibility Analysis Step 2A Prong 1:
	Claim 5 recites
wherein the method further comprises sampling parameters from the reference distribution or from the distribution, and determining the behavior policy depending on the parameter values that are sampled from the distribution (This limitation is a mental process as it encompasses a human mentally sampling parameters.)
Therefore, claim 5 recites an abstract idea.
Subject Matter Eligibility Analysis Step 2A Prong 2:
	Claim 5 does not further recite any additional elements. Therefore, claim 5 is not integrated into a practical application.
Subject Matter Eligibility Analysis Step 2B:
	Since there are no additional elements, claim 5 does not provide significantly more than the abstract idea itself, taken alone and in combination. Therefore, claim 5 is subject-matter ineligible.

Regarding Claim 6:
Subject Matter Eligibility Analysis Step 2A Prong 1:
	Claim 6 recites
wherein the method further comprises determining parameter values that result in a value of the objective function that is larger than a value of the objective function that results for other parameter values. (This limitation is a mental process as it encompasses a human mentally determining values.)
Therefore, claim 6 recites an abstract idea.
Subject Matter Eligibility Analysis Step 2A Prong 2:
	Claim 6 does not further recite any additional elements. Therefore, claim 6 is not integrated into a practical application.
Subject Matter Eligibility Analysis Step 2B:
	Since there are no additional elements, claim 6 does not provide significantly more than the abstract idea itself, taken alone and in combination. Therefore, claim 6 is subject-matter ineligible.

Regarding Claim 7:
Subject Matter Eligibility Analysis Step 2A Prong 1:
	Claim 7 recites
wherein the method further comprises determining the parameter values that maximize the value of the objective function. (This limitation is a mental process as it encompasses a human mentally determining values.)
Therefore, claim 7 recites an abstract idea.
Subject Matter Eligibility Analysis Step 2A Prong 2:
	Claim 7 does not further recite any additional elements. Therefore, claim 7 is not integrated into a practical application.
Subject Matter Eligibility Analysis Step 2B:
	Since there are no additional elements, claim 7 does not provide significantly more than the abstract idea itself, taken alone and in combination. Therefore, claim 7 is subject-matter ineligible.

Regarding Claim 8:
Subject Matter Eligibility Analysis Step 2A Prong 1:
	Claim 8 recites
wherein the method further comprises determining the behavior policy depending on initial parameter values or depending on the parameter values. (This limitation is a mental process as it encompasses a human mentally determining the policy.)
Therefore, claim 8 recites an abstract idea.
Subject Matter Eligibility Analysis Step 2A Prong 2:
	Claim 8 does not further recite any additional elements. Therefore, claim 8 is not integrated into a practical application.
Subject Matter Eligibility Analysis Step 2B:
	Since there are no additional elements, claim 8 does not provide significantly more than the abstract idea itself, taken alone and in combination. Therefore, claim 8 is subject-matter ineligible.

Regarding Claim 9:
Subject Matter Eligibility Analysis Step 2A Prong 1:
	Claim 9 recites
wherein the method comprises determining the policy depending on the parameter values or determining the distribution and sampling the parameters of the policy from the distribution. (This limitation is a mental process as it encompasses a human mentally determining the policy.)
Therefore, claim 9 recites an abstract idea.
Subject Matter Eligibility Analysis Step 2A Prong 2:
	Claim 9 does not further recite any additional elements. Therefore, claim 9 is not integrated into a practical application.
Subject Matter Eligibility Analysis Step 2B:
	Since there are no additional elements, claim 9 does not provide significantly more than the abstract idea itself, taken alone and in combination. Therefore, claim 9 is subject-matter ineligible.

Regarding Claim 10:
Subject Matter Eligibility Analysis Step 2A Prong 1:
	Claim 10 recites
determining output data from the input data with the policy, for controlling an apparatus. (This limitation is a mental process as it encompasses a human mentally determining data.)
Therefore, claim 10 recites an abstract idea.
Subject Matter Eligibility Analysis Step 2A Prong 2:
	Claim 10 further recites additional elements of
Receiving input data (This element does not integrate the abstract idea into a practical application because it recites insignificant extra-solution activity of data gathering (see MPEP 2106.05(g)).)
Therefore, claim 10 is not integrated into a practical application.
Subject Matter Eligibility Analysis Step 2B:
	The additional elements of claim 10 do not provide significantly more than the abstract idea itself, taken alone and in combination because
Receiving input data is the well understood, routine, and conventional activity of “transmitting or receiving data over a network” (see MPEP 2106.05(d)(II); OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network)).
Therefore, claim 10 is subject-matter ineligible.

Regarding Claim 11:
Subject Matter Eligibility Analysis Step 1:
	Claim 11 recites a device and is thus an apparatus, one of the four statutory categories of patentable subject matter.
Subject Matter Eligibility Analysis Step 2A Prong 1:
	Claim 11 recites
determine a behavior policy depending on the policy; (This limitation is a mental process as it encompasses a human mentally determining a behavior policy.)
sample a training data set with the behavior policy; (This limitation is a mental process as it encompasses a human mentally sampling a data set.)
determine an update for the parameter values with an objective function, wherein the objective function maps a difference between an estimate for an expected reward when following the policy and an estimate for a distance between the policy and the behavior policy, that depends on the policy and on the behavior policy, to the update, wherein the objective function corresponds to a lower bound that is determined from the training data (This limitation is a mental process as it encompasses a human mentally determining an update using an objective function.)
determining an update for the distribution with an objective function; wherein the objective function maps a difference between an expectancy value for an estimate for an expected reward when following the policy and an expectancy value for an estimate for a distance between the policy and the behavior policy, that depends on the policy and on the behavior policy, to the update, wherein the objective function corresponds to a lower bound that is determined from the training data. (This limitation is a mental process as it encompasses a human mentally determining an update using an objective function.)
Therefore, claim 11 recites an abstract idea.
Subject Matter Eligibility Analysis Step 2A Prong 2:
	Claim 11 further recites additional elements of
An input (This element does not integrate the abstract idea into a practical application because it recites a generic computing component on which to perform the abstract idea (see MPEP 2106.05(f)).)
An output (This element does not integrate the abstract idea into a practical application because it recites a generic computing component on which to perform the abstract idea (see MPEP 2106.05(f)).)
At least one processor (This element does not integrate the abstract idea into a practical application because it recites a generic computing component on which to perform the abstract idea (see MPEP 2106.05(f)).)
At least one storage (This element does not integrate the abstract idea into a practical application because it recites a generic computing component on which to perform the abstract idea (see MPEP 2106.05(f)).)
provide parameter values of a policy for reinforcement learning; (This element does not integrate the abstract idea into a practical application because it recites insignificant extra-solution activity of data gathering (see MPEP 2106.05(g)).)
provide a distribution for parameter values of a policy for reinforcement learning (This element does not integrate the abstract idea into a practical application because it recites insignificant extra-solution activity of data gathering (see MPEP 2106.05(g)).)
Therefore, claim 11 is not integrated into a practical application.
Subject Matter Eligibility Analysis Step 2B:
	The additional elements of claim 11 do not provide significantly more than the abstract idea itself, taken alone and in combination because
An input uses a computer as a tool to perform the abstract idea and cannot provide significantly more (see MPEP 2106.05(f)).
An output uses a computer as a tool to perform the abstract idea and cannot provide significantly more (see MPEP 2106.05(f)).
At least one processor uses a computer as a tool to perform the abstract idea and cannot provide significantly more (see MPEP 2106.05(f)).
At least one storage uses a computer as a tool to perform the abstract idea and cannot provide significantly more (see MPEP 2106.05(f)).
provide parameter values of a policy for reinforcement learning is the well understood, routine, and conventional activity of “transmitting or receiving data over a network” (see MPEP 2106.05(d)(II); OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network)).
provide a distribution for parameter values of a policy for reinforcement learning is the well understood, routine, and conventional activity of “transmitting or receiving data over a network” (see MPEP 2106.05(d)(II); OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network)).
Therefore, claim 11 is subject-matter ineligible.

Regarding Claim 12:
Subject Matter Eligibility Analysis Step 1:
	Claim 12 recites a non-transitory computer-readable medium and is thus a product of manufacture, one of the four statutory categories of patentable subject matter.
Subject Matter Eligibility Analysis Step 2A Prong 1:
	Claim 12 recites
determining a behavior policy depending on the policy; (This limitation is a mental process as it encompasses a human mentally determining a behavior policy.)
sampling a training data set with the behavior policy; (This limitation is a mental process as it encompasses a human mentally sampling a data set.)
determining an update for the parameter values with an objective function; wherein the objective function maps a difference between an estimate for an expected reward when following the policy and an estimate for a distance between the policy and the behavior policy, that depends on the policy and on the behavior policy, to the update, wherein the objective function corresponds to a lower bound that is determined from the training data (This limitation is a mental process as it encompasses a human mentally determining an update using an objective function.)
determining an update for the distribution with an objective function; wherein the objective function maps a difference between an expectancy value for an estimate for an expected reward when following the policy and an expectancy value for an estimate for a distance between the policy and the behavior policy, that depends on the policy and on the behavior policy, to the update, wherein the objective function corresponds to a lower bound that is determined from the training data. (This limitation is a mental process as it encompasses a human mentally determining an update using an objective function.)
Therefore, claim 12 recites an abstract idea.
Subject Matter Eligibility Analysis Step 2A Prong 2:
	Claim 12 further recites additional elements of
A non-transitory computer-readable medium on which is stored a computer program including computer-readable instructions for reinforcement learning, wherein the instructions, when executed by a processor, causing the processor to perform (This element does not integrate the abstract idea into a practical application because it amounts to mere “apply it on a computer” (see MPEP 2106.05(f)).)
providing parameter values of a policy for reinforcement learning; (This element does not integrate the abstract idea into a practical application because it recites insignificant extra-solution activity of data gathering (see MPEP 2106.05(g)).)
providing a distribution for parameter values of a policy for reinforcement learning (This element does not integrate the abstract idea into a practical application because it recites insignificant extra-solution activity of data gathering (see MPEP 2106.05(g)).)
Therefore, claim 12 is not integrated into a practical application.
Subject Matter Eligibility Analysis Step 2B:
	The additional elements of claim 12 do not provide significantly more than the abstract idea itself, taken alone and in combination because
A non-transitory computer-readable medium on which is stored a computer program including computer-readable instructions for reinforcement learning, wherein the instructions, when executed by a processor, causing the processor to perform uses a computer as a tool to perform the abstract idea and cannot provide significantly more (see MPEP 2106.05(f)).
providing parameter values of a policy for reinforcement learning is the well understood, routine, and conventional activity of “transmitting or receiving data over a network” (see MPEP 2106.05(d)(II); OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network)).
providing a distribution for parameter values of a policy for reinforcement learning is the well understood, routine, and conventional activity of “transmitting or receiving data over a network” (see MPEP 2106.05(d)(II); OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network)).
Therefore, claim 12 is subject-matter ineligible.


Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1, 4-5, and 8-10 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Peters et al. (“Relative Entropy Policy Search”) (hereafter referred to as Peters).

Regarding claim 1, Peters teaches
A method for reinforcement learning, wherein the method comprises the following steps: providing parameter values of a policy for reinforcement learning (Peters, page 1, column 1, Introduction, “Policy search is a reinforcement learning approach that attempts to learn improved policies based on information observed in past trials or from observations of another agent’s actions” and Peters, page 2, 2nd column, 1st paragraph, “In this equation,                                 
                                    
                                        
                                            ϕ
                                        
                                        
                                            s
                                        
                                        
                                            T
                                        
                                    
                                    θ
                                     
                                    =
                                     
                                    
                                        
                                            V
                                        
                                        
                                            θ
                                        
                                    
                                    (
                                    s
                                    )
                                
                             is known today as value function while the Langrangian multipliers θ become parameters and λ the average return.” See Table 1 below (Peters, page 3, Table 1) 

    PNG
    media_image1.png
    656
    424
    media_image1.png
    Greyscale

Examiner notes that θ are parameter values that are then used to evaluate a policy.); 
determining a behavior policy depending on the policy (Peters, page 3, Table 1,

    PNG
    media_image2.png
    656
    424
    media_image2.png
    Greyscale

Examiner notes that the Actor computes a new policy or the behavior policy based on the Critic’s evaluation of the policy.); 
sampling a training data set with the behavior policy (Peters, page 3, Table 1,

    PNG
    media_image3.png
    656
    424
    media_image3.png
    Greyscale

Examiner notes that Sampling occurs by using the current policy (i.e. on-policy) or new policy computed by the actor (another policy).); 
and determining an update for the parameter values with an objective function (Peters, page 3, Table 1)

    PNG
    media_image4.png
    656
    424
    media_image4.png
    Greyscale

Examiner notes that the Dual Function is the objective function and that the Dual function is within the policy update for loop which uses parameters or parameter values to evaluate.); 
wherein the objective function maps a difference between an estimate for an expected reward when following the policy and an estimate for a distance between the policy and the behavior policy, that depends on the policy and on the behavior policy, to the update (Peters, page 2, 1st column, Relative Entropy Policy Search Motivation section, “Relative entropy policy search (REPS) aims at finding the optimal policy that maximizes the expected return based on all observed series of states, actions and rewards. At the same time, we intend to bound the loss of information measured using relative entropy between the observed data distribution q(s,a) and the data distribution                                 
                                    
                                        
                                            p
                                        
                                        
                                            π
                                        
                                    
                                    (
                                    s
                                    ,
                                    a
                                    )
                                     
                                    =
                                     
                                    
                                        
                                            μ
                                        
                                        
                                            π
                                        
                                    
                                    (
                                    s
                                    )
                                    π
                                    (
                                    a
                                    |
                                    s
                                    )
                                
                             generated by the new policy                                 
                                    π
                                
                            ” where “in order to turn proposed solution into algorithms, we need to efficiently determine the solution (θ*, η*) of the dual function g” (Peters, page 3, 1st column, 2nd paragraph). Examiner notes that the dual function is the objective function. Examiner further notes that the objective of REPS is to maximize the expected return and bound the loss of information. The expected return is the estimate for an expected reward when following the policy. Bounding the loss of information is an estimate for a distance between the policy and the behavior policy. ), 
wherein the objective function corresponds to a lower bound that is determined from the training data (Peters, page 3, Table 1, 
    PNG
    media_image5.png
    970
    557
    media_image5.png
    Greyscale

Examiner notes that the objective function is the Dual function and the lower bound it corresponds to is the minimum computed in the second box labeled with “Optimize”. Examiner further notes that the lower bound is determined from the training data or the samples since g, which is used in the Optimize step is computed from the samples.)
or wherein the method comprises the following steps: providing a distribution for parameter values of a policy for reinforcement learning (Peters, page 1, 2nd column, 2nd paragraph, “In this paper, we propose a new method based on this insight, that allows us to estimate new policies given a data distribution both for off-policy or on-policy reinforcement learning.” Examiner notes that the data distribution is a distribution for parameter values.);
determining a behavior policy depending on the policy(Peters, page 3, Table 1,

    PNG
    media_image2.png
    656
    424
    media_image2.png
    Greyscale

Examiner notes that the Actor computes a new policy or the behavior policy based on the Critic’s evaluation of the policy.), 
sampling a training data set with the behavior policy (Peters, page 3, Table 1,

    PNG
    media_image3.png
    656
    424
    media_image3.png
    Greyscale

Examiner notes that Sampling occurs by using the current policy or new policy computed by the actor.); 
and determining an update for the distribution with an objective function (Peters, page 1, 2nd column, 2nd paragraph, “As we do not work in a parametrized policy gradient framework, we can directly compute a policy update based on all information observed from previous policies or exploratory sampling distributions. All sufficient statistics can be determined by optimizing the dual function that yields the equivalent of a value function of a policy for a data set.” Examiner notes that the objective function is the dual function.); 
wherein the objective function maps a difference between an expectancy value for an estimate for an expected reward when following the policy and an expectancy value for an estimate for a distance between the policy and the behavior policy, that depends on the policy and on the behavior policy, to the update (Peters, page 2, 1st column, Relative Entropy Policy Search Motivation section, “Relative entropy policy search (REPS) aims at finding the optimal policy that maximizes the expected return based on all observed series of states, actions and rewards. At the same time, we intend to bound the loss of information measured using relative entropy between the observed data distribution q(s,a) and the data distribution                                 
                                    
                                        
                                            p
                                        
                                        
                                            π
                                        
                                    
                                    (
                                    s
                                    ,
                                    a
                                    )
                                     
                                    =
                                     
                                    
                                        
                                            μ
                                        
                                        
                                            π
                                        
                                    
                                    (
                                    s
                                    )
                                    π
                                    (
                                    a
                                    |
                                    s
                                    )
                                
                             generated by the new policy                                 
                                    π
                                
                            . Ideally, we want to make use of every sample (s, a, s’, r) independently, hence, we express the information loss bound as                                 
                                    D
                                    (
                                    
                                        
                                            p
                                        
                                        
                                            π
                                        
                                    
                                    |
                                    |
                                    q
                                    )
                                    =
                                    
                                        
                                            ∑
                                            
                                                s
                                                .
                                                a
                                            
                                        
                                        
                                            
                                                
                                                    μ
                                                
                                                
                                                    π
                                                
                                            
                                            (
                                            s
                                            )
                                            π
                                            (
                                            a
                                            |
                                            s
                                            )
                                            
                                                
                                                    log
                                                
                                                ⁡
                                                
                                                    
                                                        
                                                            
                                                                
                                                                    μ
                                                                
                                                                
                                                                    π
                                                                
                                                            
                                                            (
                                                            s
                                                            )
                                                            π
                                                            (
                                                            a
                                                            |
                                                            s
                                                            )
                                                        
                                                        
                                                            q
                                                            (
                                                            s
                                                            ,
                                                            a
                                                            )
                                                        
                                                    
                                                
                                            
                                        
                                    
                                    ≤
                                     
                                    ε
                                
                            , (4) where                                 
                                    D
                                    (
                                    
                                        
                                            p
                                        
                                        
                                            π
                                        
                                    
                                    |
                                    |
                                    q
                                
                            ) denotes the Kullback-Leibler divergence, q(s,a) denotes the observed state-action distribution, and                                 
                                    ε
                                
                             is our maximal information loss” where “in order to turn proposed solution into algorithms, we need to efficiently determine the solution (θ*, η*) of the dual function g” (Peters, page 3, 1st column, 2nd paragraph). Examiner notes that the dual function is the objective function. Examiner further notes that the objective of REPS is to maximize the expected return and bound the loss of information. The expected return is the expectancy value for an estimate for an expected reward when following the policy. Bounding the loss of information is an estimate for a distance between the policy and the behavior policy where D is the expectancy value for an estimate for a distance between the policy and the behavior policy. )
wherein the objective function corresponds to a lower bound that is determined from the training data (Peters, page 3, Table 1, 
    PNG
    media_image5.png
    970
    557
    media_image5.png
    Greyscale

Examiner notes that the objective function is the Dual function and the lower bound it corresponds to is the minimum computed in the second box labeled with “Optimize”. Examiner further notes that the lower bound is determined from the training data or the samples since g, which is used in the Optimize step is computed from the samples.). 

Regarding claim 4, Peters teaches,
The method according to claim 1, wherein the method further comprises providing a reference distribution over the parameter values, and providing a confidence parameter, wherein the objective function includes a term that depends on a sum of the confidence parameter and a Kullback-Leibler divergence between the distribution and the reference distribution (Peters, page 2, 1st column, Relative Entropy Policy Search Motivation section, “Relative entropy policy search (REPS) aims at finding the optimal policy that maximizes the expected return based on all observed series of states, actions and rewards. At the same time, we intend to bound the loss of information measured using relative entropy between the observed data distribution q(s,a) and the data distribution                                 
                                    
                                        
                                            p
                                        
                                        
                                            π
                                        
                                    
                                    (
                                    s
                                    ,
                                    a
                                    )
                                     
                                    =
                                     
                                    
                                        
                                            μ
                                        
                                        
                                            π
                                        
                                    
                                    (
                                    s
                                    )
                                    π
                                    (
                                    a
                                    |
                                    s
                                    )
                                
                             generated by the new policy                                 
                                    π
                                
                            . Ideally, we want to make use of every sample (s, a, s’, r) independently, hence, we express the information loss bound as                                 
                                    D
                                    (
                                    
                                        
                                            p
                                        
                                        
                                            π
                                        
                                    
                                    |
                                    |
                                    q
                                    )
                                    =
                                    
                                        
                                            ∑
                                            
                                                s
                                                .
                                                a
                                            
                                        
                                        
                                            
                                                
                                                    μ
                                                
                                                
                                                    π
                                                
                                            
                                            (
                                            s
                                            )
                                            π
                                            (
                                            a
                                            |
                                            s
                                            )
                                            
                                                
                                                    log
                                                
                                                ⁡
                                                
                                                    
                                                        
                                                            
                                                                
                                                                    μ
                                                                
                                                                
                                                                    π
                                                                
                                                            
                                                            (
                                                            s
                                                            )
                                                            π
                                                            (
                                                            a
                                                            |
                                                            s
                                                            )
                                                        
                                                        
                                                            q
                                                            (
                                                            s
                                                            ,
                                                            a
                                                            )
                                                        
                                                    
                                                
                                            
                                        
                                    
                                    ≤
                                     
                                    ε
                                
                            , (4) where                                 
                                    D
                                    (
                                    
                                        
                                            p
                                        
                                        
                                            π
                                        
                                    
                                    |
                                    |
                                    q
                                
                            ) denotes the Kullback-Leibler divergence, q(s,a) denotes the observed state-action distribution, and                                 
                                    ε
                                
                             is our maximal information loss” and see Table 1 below (Peters, page 3, Table 1) 

    PNG
    media_image6.png
    656
    424
    media_image6.png
    Greyscale

Examiner notes that the reference distribution is the observed data distribution, the confidence parameter is                         
                            
                                
                                    δ
                                
                                
                                    θ
                                
                            
                            (
                            s
                            ,
                            a
                            )
                        
                     from the Bellman Error Function, and the Kullback-Leibler divergence is ε. The sum of the confidence parameter and the Kullback-Leibler divergence can be found in the Dual Function or objective function.).

Regarding claim 5, Peters teaches
The method according to claim 4, wherein the method further comprises sampling parameters from the reference distribution or from the distribution, and determining the behavior policy depending on the parameter values that are sampled from the distribution (Peters, page 3, Table 1,

    PNG
    media_image7.png
    656
    424
    media_image7.png
    Greyscale

Examiner notes that sampling parameters is performed in the optimize step which uses the reference distribution q(s,a) within the input of function g. After the optimize step, the Actor determines the behavior policy or new policy.).

Regarding claim 8, Peters teaches
The method according to claim 1, wherein the method further comprises determining the behavior policy depending on initial parameter values or depending on the parameter values (Peters, page 3, Table 1,

    PNG
    media_image8.png
    656
    424
    media_image8.png
    Greyscale

Examiner notes computing a new policy or behavior policy is based on parameters θ and η.).  

Regarding claim 9, Peters teaches,
The method according to claim 1, wherein the method comprises determining the policy depending on the parameter values or determining the distribution and sampling the parameters of the policy from the distribution (Peters, page 3, Table 1,

    PNG
    media_image8.png
    656
    424
    media_image8.png
    Greyscale

Examiner notes computing a new policy or behavior policy is based on parameters θ and η.).

Regarding claim 10, Peters teaches,
The method according to claim 9, wherein the method comprises receiving input data and determining output data from the input data with the policy, for controlling an apparatus (Peters, page 4, 1st column, 1st paragraph, “Subsequently, we show first steps towards a robot application currently under development” where “Application for REPS for reinforcement learning of motor primitive selection for robot table tennis has been successful in simulation” and (Peters, page 3, Table 1)

    PNG
    media_image9.png
    656
    424
    media_image9.png
    Greyscale

Examiner notes that the input data are the features and maximal information loss. Examiner further notes that the output data is the new policy which was computed using the input data and a previous policy. Additionally, Examiner notes that the robot is the apparatus to be controlled.).  


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claim(s) 2-3, 6-7, and 11-12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Peters in view of Kwon et al. (US 2022/0363279 A1) (hereafter referred to as Kwon).

Regarding claim 2, Peters teaches
The method according to claim 1, wherein the method further comprises determining the update for the distribution depending on a distribution that results in a value of the objective function (Peters, page 1, 2nd column, 2nd paragraph, “As we do not work in a parametrized policy gradient framework, we can directly compute a policy update based on all information observed from previous policies or exploratory sampling distributions. All sufficient statistics can be determined by optimizing the dual function that yields the equivalent of a value function of a policy for a data set” and see Table 1 below (Peters, page 3, Table 1) 

    PNG
    media_image10.png
    656
    424
    media_image10.png
    Greyscale

Examiner notes that the objective function is the dual function and the value of the objective function is g.) 
Peters teaches determining an update for the distribution depending on a distribution that results in a value of the objective function. Peters does not explicitly disclose a value of the objective function that is larger than a value of the objective function that results for at least one other distribution. Kwon, however does disclose 
a value of the objective function that is larger than a value of the objective function that results for at least one other distribution (Kwon, page 26, paragraph 0107, “The objective function                                 
                                    
                                        
                                            L
                                        
                                        
                                            t
                                        
                                        
                                            V
                                            F
                                        
                                    
                                
                             of the critic network may be represented using a larger value of two objective functions                                 
                                    
                                        
                                            L
                                        
                                        
                                            t
                                        
                                        
                                            
                                                
                                                    V
                                                    F
                                                
                                                
                                                    1
                                                
                                            
                                        
                                    
                                
                            , and                                 
                                    
                                        
                                            L
                                        
                                        
                                            t
                                        
                                        
                                            
                                                
                                                    V
                                                    F
                                                
                                                
                                                    1
                                                
                                            
                                        
                                    
                                
                            .                                 
                                    
                                        
                                            L
                                        
                                        
                                            t
                                        
                                        
                                            
                                                
                                                    V
                                                    F
                                                
                                                
                                                    1
                                                
                                            
                                        
                                    
                                    =
                                     
                                    
                                        
                                            (
                                            
                                                
                                                    V
                                                
                                                
                                                    
                                                        
                                                            θ
                                                        
                                                        
                                                            V
                                                            F
                                                        
                                                    
                                                
                                            
                                            (
                                            
                                                
                                                    s
                                                
                                                
                                                    t
                                                    j
                                                
                                            
                                            )
                                             
                                            -
                                             
                                            
                                                
                                                    V
                                                
                                                
                                                    t
                                                    a
                                                    r
                                                    g
                                                
                                            
                                            )
                                        
                                        
                                            2
                                        
                                    
                                
                             is represented through the squared error, i.e., the squared difference between the target value function                                 
                                    
                                        
                                            V
                                        
                                        
                                            t
                                        
                                        
                                            t
                                            a
                                            r
                                            g
                                        
                                    
                                
                             and the current value function                                 
                                    
                                        
                                            V
                                        
                                        
                                            g
                                        
                                        
                                            V
                                            B
                                        
                                    
                                    (
                                    
                                        
                                            s
                                        
                                        
                                            t
                                            j
                                        
                                    
                                    )
                                
                             approximated by the parameter                                 
                                    
                                        
                                            θ
                                        
                                        
                                            V
                                            F
                                        
                                    
                                
                             of the critic network” where “the final goal of the autonomous vehicle is to learn the optimal policy π that maximizes future cumulative rewards. In this instance, the policy a indicates a distribution of actions atj that can be taken in the state stj” (Kwon, page 25, paragraph 0094) and where “likewise, the critic network uses the state information stj, but outputs the value function” (Kwon, page 26, paragraph 0108). Examiner notes that the critic network uses a distribution of s to update the distribution or policy by using a larger value of two objective functions.)
Peters and Kwon are considered analogous to the claimed invention because they both use objective functions in reinforcement learning. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Peters to have a value of an objective function that is larger than another value of an objective function. Doing so allows for an “optimal policy π that maximizes future cumulative rewards” (Kwon, page 25, paragraph 0094).
Regarding claim 3, Peters in view of Kwon teaches the method according to claim 2. Peters further teaches
wherein the method further comprises determining the update for the distribution depending on the distribution that maximizes the value of the objective function (Peters, page 2, 1st column, Relative Entropy Policy Search Motivation section, “Relative entropy policy search (REPS) aims at finding the optimal policy that maximizes the expected return based on all observed series of states, actions and rewards. At the same time, we intend to bound the loss of information measured using relative entropy between the observed data distribution q(s,a) and the data distribution                                 
                                    
                                        
                                            p
                                        
                                        
                                            π
                                        
                                    
                                    (
                                    s
                                    ,
                                    a
                                    )
                                     
                                    =
                                     
                                    
                                        
                                            μ
                                        
                                        
                                            π
                                        
                                    
                                    (
                                    s
                                    )
                                    π
                                    (
                                    a
                                    |
                                    s
                                    )
                                
                             generated by the new policy                                 
                                    π
                                
                            . Ideally, we want to make use of every sample (s, a, s’, r) independently, hence, we express the information loss bound as                                 
                                    D
                                    (
                                    
                                        
                                            p
                                        
                                        
                                            π
                                        
                                    
                                    |
                                    |
                                    q
                                    )
                                    =
                                    
                                        
                                            ∑
                                            
                                                s
                                                .
                                                a
                                            
                                        
                                        
                                            
                                                
                                                    μ
                                                
                                                
                                                    π
                                                
                                            
                                            (
                                            s
                                            )
                                            π
                                            (
                                            a
                                            |
                                            s
                                            )
                                            
                                                
                                                    log
                                                
                                                ⁡
                                                
                                                    
                                                        
                                                            
                                                                
                                                                    μ
                                                                
                                                                
                                                                    π
                                                                
                                                            
                                                            (
                                                            s
                                                            )
                                                            π
                                                            (
                                                            a
                                                            |
                                                            s
                                                            )
                                                        
                                                        
                                                            q
                                                            (
                                                            s
                                                            ,
                                                            a
                                                            )
                                                        
                                                    
                                                
                                            
                                        
                                    
                                    ≤
                                     
                                    ε
                                
                            , (4) where                                 
                                    D
                                    (
                                    
                                        
                                            p
                                        
                                        
                                            π
                                        
                                    
                                    |
                                    |
                                    q
                                
                            ) denotes the Kullback-Leibler divergence, q(s,a) denotes the observed state-action distribution, and                                 
                                    ε
                                
                             is our maximal information loss” where “in order to turn proposed solution into algorithms, we need to efficiently determine the solution (θ*, η*) of the dual function g” (Peters, page 3, 1st column, 2nd paragraph). Examiner notes that the dual function is the objective function. Examiner further notes that the objective of REPS is to maximize the expected return and bound the loss of information. The expected return is the value of the objective function. Examiner further notes that the distribution is the data distributions.).
	
Regarding claim 6, Peters teaches the method according to claim 1. Peters does not teach, but Kwon does teach 
wherein the method further comprises determining parameter values that result in a value of the objective function that is larger than a value of the objective function that results for other parameter values (Kwon, page 26, paragraph 0107, “The objective function                                 
                                    
                                        
                                            L
                                        
                                        
                                            t
                                        
                                        
                                            V
                                            F
                                        
                                    
                                
                             of the critic network may be represented using a larger value of two objective functions                                 
                                    
                                        
                                            L
                                        
                                        
                                            t
                                        
                                        
                                            
                                                
                                                    V
                                                    F
                                                
                                                
                                                    1
                                                
                                            
                                        
                                    
                                
                            , and                                 
                                    
                                        
                                            L
                                        
                                        
                                            t
                                        
                                        
                                            
                                                
                                                    V
                                                    F
                                                
                                                
                                                    1
                                                
                                            
                                        
                                    
                                
                            .                                 
                                    
                                        
                                            L
                                        
                                        
                                            t
                                        
                                        
                                            
                                                
                                                    V
                                                    F
                                                
                                                
                                                    1
                                                
                                            
                                        
                                    
                                    =
                                     
                                    
                                        
                                            (
                                            
                                                
                                                    V
                                                
                                                
                                                    
                                                        
                                                            θ
                                                        
                                                        
                                                            V
                                                            F
                                                        
                                                    
                                                
                                            
                                            (
                                            
                                                
                                                    s
                                                
                                                
                                                    t
                                                    j
                                                
                                            
                                            )
                                             
                                            -
                                             
                                            
                                                
                                                    V
                                                
                                                
                                                    t
                                                    a
                                                    r
                                                    g
                                                
                                            
                                            )
                                        
                                        
                                            2
                                        
                                    
                                
                             is represented through the squared error, i.e., the squared difference between the target value function                                 
                                    
                                        
                                            V
                                        
                                        
                                            t
                                        
                                        
                                            t
                                            a
                                            r
                                            g
                                        
                                    
                                
                             and the current value function                                 
                                    
                                        
                                            V
                                        
                                        
                                            g
                                        
                                        
                                            V
                                            B
                                        
                                    
                                    (
                                    
                                        
                                            s
                                        
                                        
                                            t
                                            j
                                        
                                    
                                    )
                                
                             approximated by the parameter                                 
                                    
                                        
                                            θ
                                        
                                        
                                            V
                                            F
                                        
                                    
                                
                             of the critic network.” Examiner notes that                                 
                                    
                                        
                                            θ
                                        
                                        
                                            V
                                            F
                                        
                                    
                                
                             is the parameter values.).
Peters and Kwon are considered analogous to the claimed invention because they both use objective functions in reinforcement learning. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Peters to have a value of an objective function that is larger than another value of an objective function. Doing so allows for an “optimal policy π that maximizes future cumulative rewards” (Kwon, page 25, paragraph 0094).

Regarding claim 7, Peters in view of Kwon teaches the method according to claim 6. Peters further teaches
wherein the method further comprises determining the parameter values that maximize the value of the objective function (Peters, page 2, 1st column, Relative Entropy Policy Search Motivation section, “Relative entropy policy search (REPS) aims at finding the optimal policy that maximizes the expected return based on all observed series of states, actions and rewards. At the same time, we intend to bound the loss of information measured using relative entropy between the observed data distribution q(s,a) and the data distribution                                 
                                    
                                        
                                            p
                                        
                                        
                                            π
                                        
                                    
                                    (
                                    s
                                    ,
                                    a
                                    )
                                     
                                    =
                                     
                                    
                                        
                                            μ
                                        
                                        
                                            π
                                        
                                    
                                    (
                                    s
                                    )
                                    π
                                    (
                                    a
                                    |
                                    s
                                    )
                                
                             generated by the new policy                                 
                                    π
                                
                            . Ideally, we want to make use of every sample (s, a, s’, r) independently, hence, we express the information loss bound as                                 
                                    D
                                    (
                                    
                                        
                                            p
                                        
                                        
                                            π
                                        
                                    
                                    |
                                    |
                                    q
                                    )
                                    =
                                    
                                        
                                            ∑
                                            
                                                s
                                                .
                                                a
                                            
                                        
                                        
                                            
                                                
                                                    μ
                                                
                                                
                                                    π
                                                
                                            
                                            (
                                            s
                                            )
                                            π
                                            (
                                            a
                                            |
                                            s
                                            )
                                            
                                                
                                                    log
                                                
                                                ⁡
                                                
                                                    
                                                        
                                                            
                                                                
                                                                    μ
                                                                
                                                                
                                                                    π
                                                                
                                                            
                                                            (
                                                            s
                                                            )
                                                            π
                                                            (
                                                            a
                                                            |
                                                            s
                                                            )
                                                        
                                                        
                                                            q
                                                            (
                                                            s
                                                            ,
                                                            a
                                                            )
                                                        
                                                    
                                                
                                            
                                        
                                    
                                    ≤
                                     
                                    ε
                                
                            , (4) where                                 
                                    D
                                    (
                                    
                                        
                                            p
                                        
                                        
                                            π
                                        
                                    
                                    |
                                    |
                                    q
                                
                            ) denotes the Kullback-Leibler divergence, q(s,a) denotes the observed state-action distribution, and                                 
                                    ε
                                
                             is our maximal information loss” where “in order to turn proposed solution into algorithms, we need to efficiently determine the solution (θ*, η*) of the dual function g” (Peters, page 3, 1st column, 2nd paragraph). Examiner notes that the dual function is the objective function. Examiner further notes that the objective of REPS is to maximize the expected return and bound the loss of information. The expected return is the value of the objective function. Examiner further notes that the parameter values are the data distributions.).

Regarding claim 11, Peters teaches
(i): provide parameter values of a policy for reinforcement learning (Peters, page 2, 2nd column, 1st paragraph, “In this equation,                                 
                                    
                                        
                                            ϕ
                                        
                                        
                                            s
                                        
                                        
                                            T
                                        
                                    
                                    θ
                                     
                                    =
                                     
                                    
                                        
                                            V
                                        
                                        
                                            θ
                                        
                                    
                                    (
                                    s
                                    )
                                
                             is known today as value function while the Langrangian multipliers θ become parameters and λ the average return” and see Table 1 below (Peters, page 3, Table 1 see below)
 
    PNG
    media_image1.png
    656
    424
    media_image1.png
    Greyscale

Examiner notes that θ are parameter values that are then used to evaluate a policy.); 
determine a behavior policy depending on the policy (Peters, page 3, Table 1,

    PNG
    media_image2.png
    656
    424
    media_image2.png
    Greyscale

Examiner notes that the Actor computes a new policy or the behavior policy based on the Critic’s evaluation of the policy.); 
sample a training data set with the behavior policy (Peters, page 3, Table 1,

    PNG
    media_image3.png
    656
    424
    media_image3.png
    Greyscale

Examiner notes that Sampling occurs by using the current policy (i.e. on-policy) or new policy computed by the actor (another policy).); 
and determine an update for the parameter values with an objective function (Peters, page 3, Table 1)

    PNG
    media_image4.png
    656
    424
    media_image4.png
    Greyscale

Examiner notes that the Dual Function is the objective function and that the Dual function is within the policy update for loop which uses parameters or parameter values to evaluate.); 
wherein the objective function maps a difference between an estimate for an expected reward when following the policy and an estimate for a distance between the policy and the behavior policy, that depends on the policy and on the behavior policy, to the update (Peters, page 2, 1st column, Relative Entropy Policy Search Motivation section, “Relative entropy policy search (REPS) aims at finding the optimal policy that maximizes the expected return based on all observed series of states, actions and rewards. At the same time, we intend to bound the loss of information measured using relative entropy between the observed data distribution q(s,a) and the data distribution                                 
                                    
                                        
                                            p
                                        
                                        
                                            π
                                        
                                    
                                    (
                                    s
                                    ,
                                    a
                                    )
                                     
                                    =
                                     
                                    
                                        
                                            μ
                                        
                                        
                                            π
                                        
                                    
                                    (
                                    s
                                    )
                                    π
                                    (
                                    a
                                    |
                                    s
                                    )
                                
                             generated by the new policy                                 
                                    π
                                
                            ” where “in order to turn proposed solution into algorithms, we need to efficiently determine the solution (θ*, η*) of the dual function g” (Peters, page 3, 1st column, 2nd paragraph). Examiner notes that the dual function is the objective function. Examiner further notes that the objective of REPS is to maximize the expected return and bound the loss of information. The expected return is the estimate for an expected reward when following the policy. Bounding the loss of information is an estimate for a distance between the policy and the behavior policy. ), 
wherein the objective function corresponds to a lower bound that is determined from the training data (Peters, page 3, Table 1, 
    PNG
    media_image5.png
    970
    557
    media_image5.png
    Greyscale

Examiner notes that the objective function is the Dual function and the lower bound it corresponds to is the minimum computed in the second box labeled with “Optimize”. Examiner further notes that the lower bound is determined from the training data or the samples since g, which is used in the Optimize step is computed from the samples.)
or (ii): provide a distribution for parameter values of a policy for reinforcement learning (Peters, page 1, 2nd column, 2nd paragraph, “In this paper, we propose a new method based on this insight, that allows us to estimate new policies given a data distribution both for off-policy or on-policy reinforcement learning.” Examiner notes that the data distribution is a distribution for parameter values.);
determine a behavior policy depending on the policy(Peters, page 3, Table 1,

    PNG
    media_image2.png
    656
    424
    media_image2.png
    Greyscale

Examiner notes that the Actor computes a new policy or the behavior policy based on the Critic’s evaluation of the policy.), 
sample a training data set with the behavior policy (Peters, page 3, Table 1,

    PNG
    media_image3.png
    656
    424
    media_image3.png
    Greyscale

Examiner notes that Sampling occurs by using the current policy or new policy computed by the actor.); 
and determine an update for the distribution with an objective function (Peters, page 1, 2nd column, 2nd paragraph, “As we do not work in a parametrized policy gradient framework, we can directly compute a policy update based on all information observed from previous policies or exploratory sampling distributions. All sufficient statistics can be determined by optimizing the dual function that yields the equivalent of a value function of a policy for a data set.” Examiner notes that the objective function is the dual function.); 
wherein the objective function maps a difference between an expectancy value for an estimate for an expected reward when following the policy and an expectancy value for an estimate for a distance between the policy and the behavior policy, that depends on the policy and on the behavior policy, to the update (Peters, page 2, 1st column, Relative Entropy Policy Search Motivation section, “Relative entropy policy search (REPS) aims at finding the optimal policy that maximizes the expected return based on all observed series of states, actions and rewards. At the same time, we intend to bound the loss of information measured using relative entropy between the observed data distribution q(s,a) and the data distribution                                 
                                    
                                        
                                            p
                                        
                                        
                                            π
                                        
                                    
                                    (
                                    s
                                    ,
                                    a
                                    )
                                     
                                    =
                                     
                                    
                                        
                                            μ
                                        
                                        
                                            π
                                        
                                    
                                    (
                                    s
                                    )
                                    π
                                    (
                                    a
                                    |
                                    s
                                    )
                                
                             generated by the new policy                                 
                                    π
                                
                            . Ideally, we want to make use of every sample (s, a, s’, r) independently, hence, we express the information loss bound as                                 
                                    D
                                    (
                                    
                                        
                                            p
                                        
                                        
                                            π
                                        
                                    
                                    |
                                    |
                                    q
                                    )
                                    =
                                    
                                        
                                            ∑
                                            
                                                s
                                                .
                                                a
                                            
                                        
                                        
                                            
                                                
                                                    μ
                                                
                                                
                                                    π
                                                
                                            
                                            (
                                            s
                                            )
                                            π
                                            (
                                            a
                                            |
                                            s
                                            )
                                            
                                                
                                                    log
                                                
                                                ⁡
                                                
                                                    
                                                        
                                                            
                                                                
                                                                    μ
                                                                
                                                                
                                                                    π
                                                                
                                                            
                                                            (
                                                            s
                                                            )
                                                            π
                                                            (
                                                            a
                                                            |
                                                            s
                                                            )
                                                        
                                                        
                                                            q
                                                            (
                                                            s
                                                            ,
                                                            a
                                                            )
                                                        
                                                    
                                                
                                            
                                        
                                    
                                    ≤
                                     
                                    ε
                                
                            , (4) where                                 
                                    D
                                    (
                                    
                                        
                                            p
                                        
                                        
                                            π
                                        
                                    
                                    |
                                    |
                                    q
                                
                            ) denotes the Kullback-Leibler divergence, q(s,a) denotes the observed state-action distribution, and                                 
                                    ε
                                
                             is our maximal information loss” where “in order to turn proposed solution into algorithms, we need to efficiently determine the solution (θ*, η*) of the dual function g” (Peters, page 3, 1st column, 2nd paragraph). Examiner notes that the dual function is the objective function. Examiner further notes that the objective of REPS is to maximize the expected return and bound the loss of information. The expected return is the expectancy value for an estimate for an expected reward when following the policy. Bounding the loss of information is an estimate for a distance between the policy and the behavior policy where D is the expectancy value for an estimate for a distance between the policy and the behavior policy. ),
wherein the objective function corresponds to a lower bound that is determined from the training data (Peters, page 3, Table 1, see below

    PNG
    media_image5.png
    970
    557
    media_image5.png
    Greyscale

Examiner notes that the objective function is the Dual function and the lower bound it corresponds to is the minimum computed in the second box labeled with “Optimize”. Examiner further notes that the lower bound is determined from the training data or the samples since g, which is used in the Optimize step is computed from the samples.). 
Peters does not teach, but Kwon does teach, 
A device for reinforcement learning, the device comprising: an input; an output; at least one processor; and at least one storage (Kwon, page 30, paragraph 0178, The computer-readable recording include hardware devices specially designed to store and execute the program instructions” where “it is to be understood that all flow charts, state transition diagrams, pseudo-codes, and the like, illustrate various processes that may be tangibly embodied in a computer-readable medium and that are executed by computers or processors regardless of whether or not the computers or the processors are clearly illustrated” (Kwon, page 23, paragraph 0055) and see Figure 10 (Kwon, page 11, elements 31 and 33)

    PNG
    media_image11.png
    542
    1105
    media_image11.png
    Greyscale
Examiner notes that the input is the routing controller 31 which takes in the environment state and the output is the base controller 33 which outputs an action.);
Peters and Kwon are analogous to the claimed invention because they are both reinforcement learning algorithms. It would have been obvious to a person having ordinary skill in the art to have implemented Peters on a device comprising an input, output, processor, and storage. Thus this would be applying a known technique (reinforcement learning algorithm) to a known device (a device comprising an input, output, processor, and storage) ready for improvement to yield predictable results (determine policies) (MPEP 2143 I. (C) Use of known technique to improve similar devices (methods or products) in the same way).

Regarding claim 12, Peters teaches
providing parameter values of a policy for reinforcement learning (Peters, page 2, 2nd column, 1st paragraph, “In this equation,                                 
                                    
                                        
                                            ϕ
                                        
                                        
                                            s
                                        
                                        
                                            T
                                        
                                    
                                    θ
                                     
                                    =
                                     
                                    
                                        
                                            V
                                        
                                        
                                            θ
                                        
                                    
                                    (
                                    s
                                    )
                                
                             is known today as value function while the Langrangian multipliers θ become parameters and λ the average return” and see Table 1 below (Peters, page 3, Table 1) 
    PNG
    media_image1.png
    656
    424
    media_image1.png
    Greyscale

Examiner notes that θ are parameter values that are then used to evaluate a policy.); 
determining a behavior policy depending on the policy (Peters, page 3, Table 1,

    PNG
    media_image2.png
    656
    424
    media_image2.png
    Greyscale

Examiner notes that the Actor computes a new policy or the behavior policy based on the Critic’s evaluation of the policy.); 
sampling a training data set with the behavior policy (Peters, page 3, Table 1,

    PNG
    media_image3.png
    656
    424
    media_image3.png
    Greyscale

Examiner notes that Sampling occurs by using the current policy (i.e. on-policy) or new policy computed by the actor (another policy).); 
and determining an update for the parameter values with an objective function (Peters, page 3, Table 1)

    PNG
    media_image4.png
    656
    424
    media_image4.png
    Greyscale

Examiner notes that the Dual Function is the objective function and that the Dual function is within the policy update for loop which uses parameter values to evaluate.); 
wherein the objective function maps a difference between an estimate for an expected reward when following the policy and an estimate for a distance between the policy and the behavior policy, that depends on the policy and on the behavior policy, to the update (Peters, page 2, 1st column, Relative Entropy Policy Search Motivation section, “Relative entropy policy search (REPS) aims at finding the optimal policy that maximizes the expected return based on all observed series of states, actions and rewards. At the same time, we intend to bound the loss of information measured using relative entropy between the observed data distribution q(s,a) and the data distribution                                 
                                    
                                        
                                            p
                                        
                                        
                                            π
                                        
                                    
                                    (
                                    s
                                    ,
                                    a
                                    )
                                     
                                    =
                                     
                                    
                                        
                                            μ
                                        
                                        
                                            π
                                        
                                    
                                    (
                                    s
                                    )
                                    π
                                    (
                                    a
                                    |
                                    s
                                    )
                                
                             generated by the new policy                                 
                                    π
                                
                            ” where “in order to turn proposed solution into algorithms, we need to efficiently determine the solution (θ*, η*) of the dual function g” (Peters, page 3, 1st column, 2nd paragraph). Examiner notes that the dual function is the objective function. Examiner further notes that the objective of REPS is to maximize the expected return and bound the loss of information. The expected return is the estimate for an expected reward when following the policy. Bounding the loss of information is an estimate for a distance between the policy and the behavior policy. ), 
wherein the objective function corresponds to a lower bound that is determined from the training data (Peters, page 3, Table 1, see below

    PNG
    media_image5.png
    970
    557
    media_image5.png
    Greyscale

Examiner notes that the objective function is the Dual function and the lower bound it corresponds to is the minimum computed in the second box labeled with “Optimize”. Examiner further notes that the lower bound is determined from the training data or the samples since g, which is used in the Optimize step is computed from the samples.)
providing a distribution for parameter values of a policy for reinforcement learning (Peters, page 1, 2nd column, 2nd paragraph, “In this paper, we propose a new method based on this insight, that allows us to estimate new policies given a data distribution both for off-policy or on-policy reinforcement learning.” Examiner notes that the data distribution is a distribution for parameter values.);
determining a behavior policy depending on the policy(Peters, page 3, Table 1,

    PNG
    media_image2.png
    656
    424
    media_image2.png
    Greyscale

Examiner notes that the Actor computes a new policy or the behavior policy based on the Critic’s evaluation of the policy.), 
sampling a training data set with the behavior policy (Peters, page 3, Table 1,

    PNG
    media_image3.png
    656
    424
    media_image3.png
    Greyscale

Examiner notes that Sampling occurs by using the current policy or new policy computed by the actor.); 
and determining an update for the distribution with an objective function (Peters, page 1, 2nd column, 2nd paragraph, “As we do not work in a parametrized policy gradient framework, we can directly compute a policy update based on all information observed from previous policies or exploratory sampling distributions. All sufficient statistics can be determined by optimizing the dual function that yields the equivalent of a value function of a policy for a data set.” Examiner notes that the objective function is the dual function.); 
wherein the objective function maps a difference between an expectancy value for an estimate for an expected reward when following the policy and an expectancy value for an estimate for a distance between the policy and the behavior policy, that depends on the policy and on the behavior policy, to the update (Peters, page 2, 1st column, Relative Entropy Policy Search Motivation section, “Relative entropy policy search (REPS) aims at finding the optimal policy that maximizes the expected return based on all observed series of states, actions and rewards. At the same time, we intend to bound the loss of information measured using relative entropy between the observed data distribution q(s,a) and the data distribution                                 
                                    
                                        
                                            p
                                        
                                        
                                            π
                                        
                                    
                                    (
                                    s
                                    ,
                                    a
                                    )
                                     
                                    =
                                     
                                    
                                        
                                            μ
                                        
                                        
                                            π
                                        
                                    
                                    (
                                    s
                                    )
                                    π
                                    (
                                    a
                                    |
                                    s
                                    )
                                
                             generated by the new policy                                 
                                    π
                                
                            . Ideally, we want to make use of every sample (s, a, s’, r) independently, hence, we express the information loss bound as                                 
                                    D
                                    (
                                    
                                        
                                            p
                                        
                                        
                                            π
                                        
                                    
                                    |
                                    |
                                    q
                                    )
                                    =
                                    
                                        
                                            ∑
                                            
                                                s
                                                .
                                                a
                                            
                                        
                                        
                                            
                                                
                                                    μ
                                                
                                                
                                                    π
                                                
                                            
                                            (
                                            s
                                            )
                                            π
                                            (
                                            a
                                            |
                                            s
                                            )
                                            
                                                
                                                    log
                                                
                                                ⁡
                                                
                                                    
                                                        
                                                            
                                                                
                                                                    μ
                                                                
                                                                
                                                                    π
                                                                
                                                            
                                                            (
                                                            s
                                                            )
                                                            π
                                                            (
                                                            a
                                                            |
                                                            s
                                                            )
                                                        
                                                        
                                                            q
                                                            (
                                                            s
                                                            ,
                                                            a
                                                            )
                                                        
                                                    
                                                
                                            
                                        
                                    
                                    ≤
                                     
                                    ε
                                
                            , (4) where                                 
                                    D
                                    (
                                    
                                        
                                            p
                                        
                                        
                                            π
                                        
                                    
                                    |
                                    |
                                    q
                                
                            ) denotes the Kullback-Leibler divergence, q(s,a) denotes the observed state-action distribution, and                                 
                                    ε
                                
                             is our maximal information loss” where “in order to turn proposed solution into algorithms, we need to efficiently determine the solution (θ*, η*) of the dual function g” (Peters, page 3, 1st column, 2nd paragraph). Examiner notes that the dual function is the objective function. Examiner further notes that the objective of REPS is to maximize the expected return and bound the loss of information. The expected return is the expectancy value for an estimate for an expected reward when following the policy. Bounding the loss of information is an estimate for a distance between the policy and the behavior policy where D is the expectancy value for an estimate for a distance between the policy and the behavior policy. )
wherein the objective function corresponds to a lower bound that is determined from the training data (Peters, page 3, Table 1, see below
 
    PNG
    media_image5.png
    970
    557
    media_image5.png
    Greyscale

Examiner notes that the objective function is the Dual function and the lower bound it corresponds to is the minimum computed in the second box labeled with “Optimize”. Examiner further notes that the lower bound is determined from the training data or the samples since g, which is used in the Optimize step is computed from the samples.). 
Peters does not teach, but Kwon does teach, 
A non-transitory computer-readable medium on which is stored a computer program including computer-readable instructions for reinforcement learning, wherein the instructions, when executed by a processor, causing the processor to perform the following steps (Kwon, page 30, paragraph 0178, The computer-readable recording include hardware devices specially designed to store and execute the program instructions” where “it is to be understood that all flow charts, state transition diagrams, pseudo-codes, and the like, illustrate various processes that may be tangibly embodied in a computer-readable medium and that are executed by computers or processors regardless of whether or not the computers or the processors are clearly illustrated” (Kwon, page 23, paragraph 0055).)
Peters and Kwon are analogous to the claimed invention because they are both reinforcement learning algorithms. It would have been obvious to a person having ordinary skill in the art to have implemented Peters on a non-transitory computer readable medium on which is stored computer-readable instructions executable by a processor. Thus this would be applying a known technique (reinforcement learning algorithm) to a known device (a non-transitory computer readable medium on which is stored computer-readable instructions executable by a processor) ready for improvement to yield predictable results (determine policies) (MPEP 2143 I. (C) Use of known technique to improve similar devices (methods or products) in the same way).


Response to Arguments
	The specification objections have been overcome in light of the instant amendments.
	The previous claim objections have been overcome in light of the instant amendments. Examiner notes that new claim objections have been made in light of the instant amendments.
	
On pages 8-9, Applicant argues:
Applicant disagrees that the recitation of "the distribution" in claims 2, 3, 4, 5, and 9 makes them indefinite because this term is recited in one, but not the other, method recited in claim 1. Since the term is introduced using the definite article "the," there is no question that these dependent claims are referencing the "distribution" introduced in claim 1. Thus, there is no concern that the use of "the distribution" might confusingly refer to a new distribution, as the Patent Office implausibly indicates. Moreover, that "distribution" is recited in one, but not the other, method of claim 1 does not make its subsequent recitations in the dependent claims indefinite. Notably, the Patent Office appears to adopt a per se rule that anytime a claim term of a dependent claim finds antecedent support in one alternative recited in the independent claim but not another, the dependent claims are to be deemed indefinite simply by virtue of this fact. However, the Patent Office fails to cite to any caselaw or MPEP section that establishes this per se rule. But since the Patent Office does not reject the independent claim for reciting alternative methods (which is correct), it logically follows that dependent claims should not be penalized for referencing a term that appears in one of the alternative methods but not the other. That is to say, if the only justification offered to reject a dependent claim referencing a claim term recited in one but not the other alternative method of the independent claims is that the alternative recitation of two methods in the independent claim somehow makes the dependent claim indefinite, but the Patent Office accepts the alternative method recitation in the independent claim in the first place, then the mere fact that a dependent claim references a term in one of these alternative method is insufficient to render the dependent claim indefinite.
As for the "parameters" vs. "parameter values' concern that served as the basis for the rejection of dependent claims 4 - 9, the amendment to the claim 1 has obviated the basis for this rejection.

Regarding the applicant’s arguments regarding the recitation of “the distribution” in claims 2-5 and 9-10  are not indefinite, the arguments are persuasive and the corresponding 112(b) rejections have been withdrawn. Specifically, examiner respectfully notes that the argument that the antecedent basis for these dependent claims is to one embodiment in claim 1 and not to both is persuasive. 
Examiner further notes the previous 112(b) rejections regarding the term “parameter values” have been overcome in light of the instant amendments. Examiner notes that there have been new 112(b) rejection made on claims 5 and 9 in light of the instant amendments. 

On pages 9-11, Applicant argues:
Applicant submits that the overall limitation to which this amendment belongs (" ... the objective function maps a difference between ... ") was improperly deemed a mental process by the Patent Office. The Patent Office has "the initial burden ... to explain why a claim or claims are ineligible for patenting clearly and specifically, so that applicant has sufficient notice and is able to effectively respond." MPEP at 2106.07. As to Prong One, this obligation to explain "clearly and specifically" means "the rejection should identify the judicial exception by referring to what is recited (i.e., set forth or described) in the claim and explain why it is considered an exception." Id. at 2106.07(a). The MPEP defines the mental process judicial exception as "concepts performed in the human mind (including an observation, evaluation, judgment, opinion)." Id. at 2106.04(a). Moreover, "[c]laims do not recite a mental process when they do not contain limitations that can practically be performed in the human mind, for instance when the human mind is not equipped to perform the claim limitations." Id. at 2106.04(a)(2). Thus, if the Patent Office believes that the amendment recites the mental process judicial exception, it has the burden of "clearly and specifically" "explain[ing] why [they are] considered an exception." Id. at 2106.07, 2106.07(a).
The Office Action does not comply with the above-discussed mandatory guidance from the MPEP because, rather than "clearly and specifically" "explain why" they recite a mental process, it simply attaches to each of the claim limitations analyzed under Prong One the conclusory, analysis-free statement "This limitation is a mental process as it encompasses a human mentally determining an update using an objective function." Repetitively attaching the same conclusory dismissal is also inconsistent with the recently issued precedential decision Ex Parte Desjardins, Appeal No. 2024-000567 (PTAB September 26, 2025, Appeals Review Panel Decision), which vacated a new ground of rejection under 35 U.S.C. § 101 entered by a panel of the PT AB against claims directed to an improvement to machine learning, just like the present reinforcement learning claims. In vacating the Section 101 rejection, the Appeals Review Panel (ARP) condemned "overbroad reasoning" as "troubling" because "[c]ategorically excluding AI innovations from patent protection in the United States jeopardizes America's leadership in this critical emerging technology." Id. at 9. But "under the [] reasoning [employed in the current Office Action], many AI innovations are potentially unpatentable-even if they are adequately described and nonobvious-because the [Patent Office] essentially equated any machine learning with" a mental process "without adequate explanation." Id. Since the Office Action does heed the admonition in Desjardins that "Examiners and panels should not evaluate claims at such a high level of generality," Applicant submits not only that the "overbroad reasoning" of the Prong One approach in the present Office Action is improper to the claim limitations of the original claims, but would be improper for the amendment as well. 
Thus, the limitations in each alternative method of the independent claims expressed as " ... the objective function maps a difference between ... ", as amended by "wherein the objective function corresponds to a lower bound that is determined from the training data," is an "additional element" that qualifies for analysis under Prong Two.

Regarding the Applicant’s argument that the limitation of “determining an update for the parameter values with an objective function; wherein the objective function maps a difference between an estimate for an expected reward when following the policy and an estimate for a distance between the policy and the behavior policy, that depends on the policy and on the behavior policy, to the update, wherein the objective function corresponds to a lower bound that is determined from the training data” and its alternative method are not a mental processes, the examiner respectfully disagrees. Specifically, Examiner respectfully notes that “determining an update for the parameter values with an objective function” is a mental process as it encompasses a human mentally determining an update. Further it is a mental process as it encompasses a human mentally determining an update with a function. The remainder of the claim further describes this mental process of determining an update with a function because it further defines the function performed mentally. Examiner additionally notes that the analysis under 101 remains the same in view of Desjardins. 
Regarding the applicant’s argument that the Examiner did not clearly and specifically explain why a claim is ineligible for patentability, Examiner respectfully disagrees. Examiner firstly respectfully notes that according to MPEP 2106.07 (a), “the office action must provide an explanation as to why each claim is unpatentable, which must be sufficiently clear and specific.” Examiner additionally respectfully notes that the Examiner clearly and specifically explained why the claim as a whole is ineligible for patentability by breaking the analysis into steps and prongs. Examiner further respectfully notes that Examiner followed the guidance in MPEP 2106.07(a) by referring what is recited, and explaining why it is considered an exception. 

On pages 11-12, Applicant argues:
Thus, even if certain other claim limitations recite a judicial exception, this "additional element" represents a technological improvement in the field of reinforcement learning that integrates the purported judicial exception into a practical application. The MPEP mandates that a Prong Two analysis consider statements of technological improvement made in the specification. That is, one way to determine whether "the claim as a whole integrates the exception into a practical application" is that "the specification should be evaluated to determine if the disclosure provides sufficient details such that one of ordinary skill in the art would recognize the claimed invention as providing an improvement." MPEP at 2106.04(d)(l). In evaluating Prong Two in terms of technological improvement, the "Prong Two analysis considers the claim as a whole. That is, the limitations containing the judicial exception as well as the additional elements in the claim besides the judicial exception need to be evaluated together to determine whether the claim integrates the judicial exception into a practical application." MPEP at 2106.04(d), Part III.
This is the approach that the ARP took in Desjardins. To that end, the ARP in Desjardins recognized that "claims directed to an improvement in the functioning of a computer, or an improvement to other technology or technical field are patent eligible." Id. at 8. In Desjardins the ARP found the claims at issue to be eligible under Section 101 through a Prong Two analysis that concluded that the claims at issue integrated the judicial exception into a practical application. In particular, the ARP performed a Prong Two analysis on the basis of the following statements from the specification at issue there:
For example, one improvement identified in the Specification is to "effectively
learn new tasks in succession whilst protecting knowledge about previous tasks."
Spec. ,i 21. The Specification also recites that the claimed improvement allows
artificial intelligence (AI) systems to "us[ e] less of their storage capacity" and
enables "reduced system complexity." Id. When evaluating the claim as a whole,
we discern at least the following limitation of independent claim 1 that reflects the
improvement: "adjust the first values of the plurality of parameters to optimize
performance of the machine learning model on the second machine learning task
while protecting performance of the machine learning model on the first machine
learning task." We are persuaded that constitutes an improvement to how the
machine learning model itself operates, and not, for example, the identified
mathematical calculation.

This is not the approach that the Patent Office takes in the present Office Action because nowehere in the Prong Two analysis provided at page 7 does the Patent Office evaluate the statements of technological improvement in the specification. Instead, the Prong Two analysis in page 5 of the Office Action simply rests on an across-the-board characterization of additional elements stating "[t]his element does not integrate the abstract idea into a practical application because it recites insignificant extra-solution activity of data gathering." As explained above, the ARP in Desjardins analyzed the specification at issue in order to determine whether the additional claim element "constitutes an improvement to how the machine learning model itself operates." Id. at 9. Had the Patent Office considered the specification in its Prong Two analysis, it would have evaluated at least the following statement from [0032] "The method comprises using this lower bound as an objective function, since maximizing a lower bound on the expected reward provides a policy             
                π
                θ
                 
            
        that has a high expected reward." With this statement, the specification links the improvement in reinforcement learning manifested as a "high expected reward" with the use of a "lower bound as an ibjective function," and the claim amendment reflects this improvement by reciting "wherein the objective function corresponds to a lower bound that is determined from the training data," which is the feature that provides the "high expected reward" improvement to reinforcement learning. MPEP at 2106.04(d)(l)("[T]he claim must be evaluated to ensure that the claim itself reflects the disclosed improvement.").
Therefore, based on the above discussion, since the amended claims are directed to an improvement in how reinforcement learning operates, it passes muster under Prong Two.

Regarding the Applicant’s argument that the limitation of “determining an update for the parameter values with an objective function; wherein the objective function maps a difference between an estimate for an expected reward when following the policy and an estimate for a distance between the policy and the behavior policy, that depends on the policy and on the behavior policy, to the update, wherein the objective function corresponds to a lower bound that is determined from the training data” and its alternative method provide an integration into a practical application, the Examiner respectfully disagrees. Specifically, this limitation is a mental process and thus cannot provide an integration into a practical application (MPEP 2106.04 (d)(I)). 
Regarding the Applicant’s argument that the specification provides an improvement in light of the amended claims, Examiner respectfully disagrees. Specifically, Examiner notes that the judicial exception alone cannot provide the improvement (MPEP 2106.05(a)). Examiner further notes that the statement from [0032] is not reflected in the additional elements of “providing parameter values of a policy for reinforcement learning” and “providing a distribution for parameter values of a policy for reinforcement learning.” Examiner further notes that even if the statement from [0032] is reflected in the additional elements, the statement from [0032] provides a bare assertion of an improvement without the detail necessary to be apparent to one of ordinary skill in the art and, thus, cannot provide an improvement (MPEP 2106.04(d)(1)).   

On pages 12-13, Applicant argues:
Even if the Patent Office concludes that the "additional element" does not integrate the judicial exception into a practical application, the Patent Office is obligated to evaluate it under Step 2B. Section 2106.05(d) of the MPEP states that in a Step 2B analysis a "factual determination is required to support a conclusion that an additional element ( or combination of  additional elements) is well-understood, routine, conventional activity." A proper factual determination under Step 2B requires conformity with Berkheimer v. HP, Inc., 881 F .3d 1360, 1368, 125 USPQ2d 1649, 1654 (Fed. Cir. 2018). More to the point, Section 2106.07(a) of the MPEP implements Berkheimer by stating that under Step 2B:

Examiners should not assert that an additional element ( or combination of
elements) is well-understood, routine, or conventional unless the examiner finds,
and expressly supports the rejection in writing with one or more of the following:

(A) A citation to an express statement in the specification or to a statement made
by an applicant during prosecution that demonstrates the well-understood, routine,
conventional nature of the additional element(s) ...

(B) A citation to one or more of the court decisions discussed in MPEP §
2106.05(d), subsection II, as noting the well-understood, routine, conventional
nature of the additional element(s) ...

(C) A citation to a publication that demonstrates the well-understood, routine,
conventional nature of the additional element(s) ...

(D) A statement that the examiner is taking official notice of the well-understood,
routine, conventional nature of the additional element(s).

Id. at 2106.07(a). The Step 2B analysis of the Office Action is set forth in pages 7 and 8 of the Office Action:

…

This analysis fails to comply with the above-referenced framework because it relies on mere conclusory statements without analyzing this element under Step 2B by determining whether it is well-understood, routine, or conventional on the basis of evidence conforming to one or more of categories (A), (C), and (D). ). See Ex parte Mercer (Appeal No. 2024---2371) (Decision on Request for Rehearing, October 30, 2025) (nonprecedential) (granting request for rehearing of prior decision affirming Section 101 rejection for failing to provide the "factual determination [ ]required to support the conclusion that an additional element ( or combination of additional elements) is well-understood, conventional activity" in accordance with the Berkheimer factors).
Moreover, the reliance on caselaw under category (b) in the above Office Action blurb falls short because the Patent Office fails to establish that "the claim limitations before the examiner are the same as those found to be well-understood, routine, conventional by the courts." Id.

Regarding the Applicant’s argument that the Examiner’s analysis fails to comply with MPEP 2106.07(a), Examiner respectfully disagrees. Specifically, Examiner notes that Examiner has specifically complied with point (B) of MPEP 2106.07(a)(III) which states “Examiners should not assert that an additional element (or combination of elements) is well-understood, routine, or conventional unless the examiner finds, and expressly supports the rejection in writing with one or more of the following: … (B) A citation to one or more of the court decisions discussed in MPEP § 2106.05(d), subsection II, as noting the well-understood, routine, conventional nature of the additional element(s). Examiners should be careful to ensure the claim limitations before the examiner are the same as those found to be well-understood, routine, conventional by the courts. The additional elements under examination should be recited in the same manner, meaning they should be recited at the same high level of generality as in those court decisions. ” Examiner further respectfully notes that after each limitation described as well understood, routine, and conventional activity is a citation to court cases in section 2106.05(d) from the MPEP.

On pages 15-16, Applicant argues:
As discussed above, Applicant amended the independent claims to recite "wherein the objective function corresponds to a lower bound that is determined from the training data." The Patent Office maps Peters to the "training data" by highlighting the "Sampling" portion of Tabe 1 at page 3, as follows in page 24 of the Office Action:

…

Even if a POSITA would have deemed the sampling of Table 1 to be a sampling of training data, Peters does not disclose using this purported training data to determine a lower bound as in the limitation "wherein the objective function corresponds to a lower bound that is determined from the training data."
Therefore withdrawal of this rejection is requested.
	
	Section 103 Rejection Based on Peters and Kwon
Since Kwon does not overcome the deficiency in Peters, withdrawal of this rejection is requested

Regarding the Applicant’s argument that the prior art does not teach “wherein the objective function corresponds to a lower bound that is determined from the training data,” Examiner respectfully disagrees. Specifically, Peters teaches this in Table 1 with the optimizing step. In this optimizing step, a minimum, or lower bound is found. This minimum corresponds to the objective function, or dual function, since the dual function needs to be computed in order to perform the optimize function. Examiner further notes that the lower bound found in the optimize step is determined from the training data since the dual function, which is needed for the optimize step, also requires the training data. Examiner respectfully points the Applicant to the above 102 and 103 rejection. 

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Xu et al. (US 11,836,620 B2) also discloses a reinforcement learning algorithm that updates policies.
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KAITLYN R HAEFNER whose telephone number is (571)272-1429. The examiner can normally be reached Monday - Thursday: 7:15 am - 5:15 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michelle Bechtold can be reached at (571) 431-0762. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/K.R.H./Examiner, Art Unit 2148                                                                                                                                                                                                        /MICHELLE T BECHTOLD/Supervisory Patent Examiner, Art Unit 2148
Read full office action
Prosecution Timeline

Oct 14, 2022
Application Filed
Aug 26, 2025
Non-Final Rejection — §101, §102, §103
Feb 03, 2026
Response Filed
Mar 20, 2026
Final Rejection — §101, §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/750,678
Patent 12602431
METHODS FOR PERFORMING INPUT-OUTPUT OPERATIONS IN A STORAGE SYSTEM USING ARTIFICIAL INTELLIGENCE AND DEVICES THEREOF
2y 5m to grant Granted Apr 14, 2026
17/493,365
Patent 12572828
METHOD FOR INDUSTRY TEXT INCREMENT AND ELECTRONIC DEVICE
2y 5m to grant Granted Mar 10, 2026
Study what changed to get past this examiner. Based on 2 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
50%
Grant Probability
99%
With Interview (+66.7%)
4y 2m
Median Time to Grant
Moderate
PTA Risk
Based on 4 resolved cases by this examiner. Grant probability derived from career allow rate.