Office Action Analysis: 18187401 — Non-Uniform Pessimistic Reinforcement Learning

Office Action

§101 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This office action is in response to submission of application on 3/21/2023
Claims 1-20 are presented for examination
Claim Interpretation
“Non-uniform underestimation” is being interpreted as calculating quantile values, functions, and uncertainty. (Specification, summary, paragraph [0006], “using non-uniform underestimation, based on the minibatch includes computing quantile values using the plurality of predictors for state-action pairs in the minibatch; computing an uncertainty of the quantile values based on a difference in outputs of the plurality of predictors; and updating the plurality of predictors based on the uncertainty of the quantile values.”)

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


	Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 
	Step 1: Is the claim to a process, machine, manufacture or composition of matter?
	Claims 1-8 are directed to a method; claims 9-16 are directed to a system comprising memory for storing instructions; and claims 18-20 are directed to a computer readable storage medium; therefore, all claims are directed to one of the four statutory categories.
Step 2A Prong One: Does the claim recite an abstract idea, law of nature, or natural
phenomenon?
	Claim 1 recites limitations of: 
randomly sampling a dataset comprising historical training data between an agent and an environment to generate a minibatch of the historical training data - mental process (observation, evaluation, judgement) as a human mind can randomly choose data to create a dataset.
updating a plurality of predictors, using non-uniform underestimation, based on the minibatch; - mental process (observation, evaluation, judgement) as a human mind can make an update to the predictors using non-uniform underestimation (math) as well as the human mind is very capable of making predictions. 
and updating a policy using the updated plurality of predictors and the minibatch. - mental process (observation, evaluation, judgement) as a human mind can update a policy.
Step 2A Prong Two: Does the claim recite addition elements that integrate the judicial exception into a practical application?
Claim 1 does not recite any additional elements. Therefore, there is nothing further to integrate the abstract idea into a practical application. 
Step 2B: Does the claim recite additional elements that amount to significantly more than the 
judicial exception?
Claim 1 does not recite any additional elements. Therefore, there is nothing further to integrate the abstract idea into a practical application and the claim is not patent eligible.
	Independent claim 9 and 17 recites the same relevant limitations and a similar analysis applies. Claim 9 recites the additional elements of “A system comprising memory for storing instructions, and a processor configured to execute the instructions to” – components recited at a high level are construed as generic computer components used to implement the abstract idea. See MPEP 2106.05(f)(2). Claim 17 recites the additional elements of “A computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor of a system to cause the system to” – components recited at a high level are construed as generic computer components used to implement the abstract idea. See MPEP 2106.05(f)(2). They do not integrate the abstract idea into a practical application. Nor do they amount to significantly more. Therefore, the independent claims are not patent eligible. 
	The above analysis similarly applies to the dependent claims.
Dependent claim 2 and 10 recites, “determining whether the policy satisfies a performance threshold” – mental process (observation, evaluation, judgement) as a human mind can observe if a policy satisfies a threshold.
Dependent claim 3 and 11 recites, 
“sampling actions for corresponding states in the minibatch based on the policy;” - mental process (observation, evaluation, judgement) as a human mind can sample data. 
“computing quantile values for the states in the minibatch using the updated plurality of predictors;” – mathematical concept (relationships, formulas or equations, calculations) of computing quantile values
“computing risk measure for the sample actions from the quantile values;” - mathematical concept (relationships, formulas or equations, calculations) of computing risk measure
“and updating the policy so that the risk measure is maximized.” - this element constitutes “mere instructions to apply an exception.” (MPEP § 2106.05(f)). This limitation is using the abstract idea, “risk measure” that was derived by “computing quantile values” to just “apply it”/ use it for “updating a policy.”
Dependent claim 4, 12, and 18 recites, 
“computing quantile values using the plurality of predictors for state-action pairs in the minibatch;” - mathematical concept (relationships, formulas or equations, calculations) of computing quantile values 
“computing an uncertainty of the quantile values based on a difference in outputs of the plurality of predictors;” - mathematical concept (relationships, formulas or equations, calculations) of computing an uncertainty of the quantile values
“and updating the plurality of predictors based on the uncertainty of the quantile values.” - this element constitutes “mere instructions to apply an exception.” (MPEP § 2106.05(f)). This limitation is using the abstract idea, “computing an uncertainty of the quantile values,” and applying it in order for “updating the plurality of predictors.”
Dependent claim 5 and 13 recites, “updating the plurality of predictors based on the uncertainty of the quantile values utilizes a Bellman backup.” - mathematical concept (relationships, formulas or equations, calculations), Bellman backup
Dependent claim 6 and 14 recites, “updating the plurality of predictors based on the uncertainty of the quantile values further utilizes quantile regression and a quantile-wise penalty based on the uncertainty.” - mathematical concept (relationships, formulas or equations, calculations) of utilizing quantile regression and quantile-wise penalty.
Dependent claim 7, 15, and 19 recites, “pessimistically estimating a quantile function of a return distribution of the plurality of predictors by shifting the quantile function according to a quantile fraction based on the uncertainty of the quantile values.” - mathematical concept (relationships, formulas or equations, calculations) of estimating a quantile function and shifting the quantile function.
Dependent claim 8, 16, and 20 recites, “the quantile function is represented by the equation F-(lx) =F-(lx) - d(x, z), where F1 is the quantile function, z is the quantile fraction, x is a state-action pair (s,a), and d(x,r) is an uncertainty-based non-uniform amount the quantile function is pushed down.” - mathematical concept (relationships, formulas or equations, calculations), displays an equation for the quantile function. 
The dependent claims do not integrate the abstract idea into a practical application, nor do they amount to significantly more than the abstract idea. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-3, 9-11, and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kanazawa et al. (US 2024/0013090 A1, Uncertainty-Aware Continuous Control System based on Reinforcement Learning, herein Kanazawa) in view of Zhao et al., (CN 111263332 A, Unmanned Aerial Vehicle Track And Power Combined Optimizing Method Based On Deep Reinforcement Learning, herein Zhao.)
	Regarding claim 1, 
	Kanazawa teaches:
updating a plurality of predictors, using non-uniform underestimation, based on the minibatch; (Kanazawa, Abstract, “the plurality of distributional critic networks calculates quantiles of a return distribution associated with the candidate actions in relation to the state.” note: the plurality of predictors maps to the plurality of distributional critic networks. And, Kanazawa, paragraph [0042], “The distributional critic networks 514 generates calculated quantile outputs based on recommended actions and current observation/state to the action selector 516. Outputs generated from the distributional critic networks 514 are sent to the uncertainty calculator 518 to estimate uncertainties associated with the outputs.” note: the distributional critic networks generates calculated quantile outputs… to estimate uncertainties maps to using non-uniform underestimation. And, Kanazawa, paragraph [0043], “the distributional critic is updated” note: updating a plurality of predictors maps to the distributional critic is updated.)
and updating a policy using the updated plurality of predictors and the minibatch. (Kanazawa, paragraph [0049], “The actor networks, μ(θ.sub.k), are updated so as to increase the RAA of critics… 

    PNG
    media_image1.png
    69
    561
    media_image1.png
    Greyscale
” note: policy maps to actor network, updating policy maps to actor networks are updated, using the updated plurality of predictors maps to the critics, and the minibatch maps to the B in the equation)
Kanazawa does not explicitly teach, randomly sampling a dataset comprising historical training data between an agent and an environment to generate a minibatch of the historical training data.
Zhao teaches randomly sampling a dataset comprising historical training data between an agent and an environment to generate a minibatch of the historical training data; (Zhao, description, “the experience playback memory D transfer sample (stored in state S (t), the next state S ' (t), A (t) and the reward R (t)). in the learning process, the playback memory D in random sampling mini-batch sample (state si from experience, next state s' action ai and bonus ri) to update the actor network and critic network. wherein, mini-batch refers to randomly selecting data of small batch in the training data” note: randomly sampling maps to random sampling, historical training data between an agent and an environment maps to the experience playback memory B, and to generate a minibatch maps to mini-batch sample.)
It would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Zhao into Kanazawa because Kanazawa teaches updating the predictors and the policy, but in order to do so, there must be data. Zhao teaches using experience playback memory (also known as a replay buffer or historical training data) and randomly sampling from the memory in order to create a minibatch. It is important to combine Zhao and Kanazawa because in order to perform the updates, there must be data. Also, using the minibatch from an experience playback memory will allow for these updates to become more reliable because it addresses previous rewards and actions that have been evaluated which then helps the system learn from a more stable policy. 
Regarding claim 2, 
The combination of Kanazawa and Zhao teaches, The method of claim 1, further comprising determining whether the policy satisfies a performance threshold. (Kanazawa, claim 8, “comparing the epistemic uncertainty against a threshold,”)
Regarding claim 3, 
The combination of Kanazawa and Zhao teaches, The method of claim 1, wherein updating the policy using the updated plurality of predictors and the minibatch comprises:
sampling actions for corresponding states in the minibatch based on the policy; (Kanazawa, paragraph [0032], “The actor networks 204 receive current observation/state s as input and generate recommended actions a as outputs to the action selector 206”, note: sampling actions maps to generate recommended actions a, for corresponding states maps to observation/state s as input, and based on the policy maps to the actor networks.) 
computing quantile values for the states in the minibatch using the updated plurality of predictors; (Kanazawa, paragraph [0078], “the plurality of distributional critic networks calculates quantiles of a return distribution… 
    PNG
    media_image1.png
    69
    561
    media_image1.png
    Greyscale
”  note: computing quantile values maps to calculates quantiles, using the updated plurality of predictors maps to the plurality of distributional critic networks, and states in the minibatch maps in the equation where it states, s∈ B.)
computing risk measure for the sample actions from the quantile values; (Kanazawa, paragraph [0035], “The RAA, also known as conditional value at risk, measures and quantifies the risk associated with an action.” and paragraph [0036], “In the case of b=0.2, the associated CVaR is given by ⅙(q.sub.1+q.sub.2+q.sub.3+q.sub.4+q.sub.5+q.sub.6)” note: computing risk measure maps to conditional value at risk, measure and quantifies the risk, and from the quantile values maps to the formula that averages the quantiles.)
and updating the policy so that the risk measure is maximized. (Kanazawa, paragraph [0036], “Each actor network is trained to maximize such risk-sensitive averages” and paragraph [0049], “The actor networks, μ(θ.sub.k), are updated so as to increase the RAA of critics.” note: updating the policy maps to the actor networks are updated, and so that risk measure is maximized maps to each actor network is trained to maximize such risk-sensitive averages.)
Claims 9-11 is a system claim, A system comprising memory for storing instructions, and a processor configured to execute the instructions to: (Kanazawa, Figure 12, “FIG. 12 illustrates an example computing environment with an example computer device suitable for use in some example implementations… 
    PNG
    media_image2.png
    486
    583
    media_image2.png
    Greyscale
”), that corresponds to method claims 1-3, respectively. Otherwise, they are not patentably distinguishable. Therefore, claims 9-11 are rejected for the same reasons as claim 1-3, respectively. 
Claim 17 is a machine, A computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor of a system to cause the system to: (Kanazawa, claim 9, “computer readable medium, storing instructions for reinforcement learning (RL) of continuous actions, the instructions comprising”) that corresponds to method claim 1 respectively. Otherwise, they are not patentably distinguishable. Therefore, claim 17 is rejected for the same reasons as claim 1. 
Claim(s) 4, 6-8, 12, 14-16, and 18-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kanazawa in view of Zhao, and in further view of Ma et al., (Conservative Offline Distributional Reinforcement Learning, herein Ma).
Regarding claim 4, 
The combination of Kanazawa and Zhao teaches, The method of claim 1, wherein updating the plurality of predictors, using non-uniform underestimation, based on the minibatch comprises:
computing an uncertainty of the quantile values based on a difference in outputs of the plurality of predictors; (Kanazawa, paragraph [0060], “
    PNG
    media_image3.png
    50
    459
    media_image3.png
    Greyscale
”, note: computing an uncertainty maps to the formula for EU which is epistemic uncertainty, and based on the difference in outputs maps to the STD portion of the formula which is the standard deviation across the critics outputs.)
and updating the plurality of predictors based on the uncertainty of the quantile values. (Kanazawa, paragraph [0060], “
    PNG
    media_image3.png
    50
    459
    media_image3.png
    Greyscale
”, note: plurality of predictors maps in the formula where m = 1, 2, M, and M is the multiple critics, quantile values maps to qi in the formula, and uncertainty is the formula for EU which is epistemic uncertainty.)
However, Kanazawa and Zhao do not teach, computing quantile values using the plurality of predictors for state-action pairs in the minibatch;
Ma teaches, computing quantile values using the plurality of predictors for state-action pairs in the minibatch; (Ma, page 7, section 4, “we represent the quantile function as a DNN… 
    PNG
    media_image4.png
    17
    144
    media_image4.png
    Greyscale
” note: computing quantile values maps to quantile function and state-action pairs maps to (s,a).)
It would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaches of Kanazawa and Ma because Ma teaches quantile values for state-action pairs by using distributional predictors which would allow for richer information than single value estimates. Kanazawa teaches measuring uncertainty by comparing the outputs of the multiple predictors in order to see which predications could be unreliable. Therefore, combining Kanazawa’s uncertainty measurement to Ma’s quantile-based predictors and then using that uncertainty to determine how the predictors are updated is a way to reduce overconfidence and improve training stability. 
Regarding claim 6, 
The combination of Kanazawa, Zhao, and Ma teaches, The method of claim 4, wherein updating the plurality of predictors based on the uncertainty of the quantile values further utilizes quantile regression and a quantile-wise penalty based on the uncertainty. (Kanazawa, paragraph [0047], “ρ is an asymmetric Huber loss…  The loss function comprises a sum over all quantiles” note: the quantile regression maps to the asymmetric Huber loss, and the quantile-wise penalty maps to the loss function comprises a sum over all quantities.)
Regarding claim 7, 
The combination of Kanazawa, Zhao, and Ma teaches The method of claim 4, wherein updating the plurality of predictors based on the uncertainty of the quantile values comprises 
pessimistically estimating a quantile function of a return distribution of the plurality of predictors by shifting the quantile function according to a quantile fraction based on the uncertainty of the quantile values. (Ma, page 5, section 3, “
    PNG
    media_image5.png
    34
    237
    media_image5.png
    Greyscale
” note: shifting the quantile function maps to moving the quantile function which happens at the subtraction (-).)
Regarding claim 8, 
The combination of Kanazawa, Zhao, and Ma teaches The method of claim 7, wherein the quantile function is represented by the equation F-(lx) =F-(lx) - d(x, z), where F1 is the quantile function, z is the quantile fraction, x is a state-action pair (s,a), and d(x,r) is an uncertainty-based non-uniform amount the quantile function is pushed down. (Ma, page 5, section 3, “
    PNG
    media_image5.png
    34
    237
    media_image5.png
    Greyscale
” note: F−1(τ|x) maps to F−1Z˜k+1(s,a) (τ ) which is the quantile estimate of return, x maps to (s,a) which is the state-action pair, d(x, t) maps to c(s,a) which is the uncertainty, pushed down  is the subtraction and is shifting down.)
Claims 12 and 14-16 is a system claim, A system comprising memory for storing instructions, and a processor configured to execute the instructions to: (Kanazawa, Figure 12, “FIG. 12 illustrates an example computing environment with an example computer device suitable for use in some example implementations… 
    PNG
    media_image2.png
    486
    583
    media_image2.png
    Greyscale
”), that corresponds to method claims 4 and 6-8, respectively. Otherwise, they are not patentably distinguishable. Therefore, claims 12 and 14-16 are rejected for the same reasons as claims 4 and 6-8, respectively. 
Claims 18-20 are a machine, A computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor of a system to cause the system to: (Kanazawa, claim 9, “computer readable medium, storing instructions for reinforcement learning (RL) of continuous actions, the instructions comprising”) that corresponds to method claim 4, 7, and 8 respectively. Otherwise, they are not patentably distinguishable. Therefore, claims 18-20 are rejected for the same reasons as claims 4, 7, and 8, respectively.
Claim(s) 5 and 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kanazawa in view of Zhao, Ma, and in further view of Yang et al., (Safety-constrained reinforcement learning with a distributional safety critic, herein Yang). 
Regarding claim 5, 
The combination of Kanazawa, Zhao, and Ma teaches The method of claim 4, wherein updating the plurality of predictors based on the uncertainty of the quantile values.
However, they do not teach, utilizes a Bellman backup.
Yang teaches utilizes a Bellman backup, (Yang, page 864 and 865, section 2.3, “Based on the distributional Bellman operator (Morimura et al., 2010; Sobel, 1982; Tamar et al., 2016)… we can get the TD error between the quantile values at quantile fractions” note: utilizes a Bellman backup maps to based on the distributional Bellman operator.)
It would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaches of Kanazawa, Zhao, Ma, and Yang because Yang teaches incorporating a Bellman backup in order to update the predictors from Kanazawa, Zhao, and Ma. The Bellman backup allows for an efficient iterative policy determination which is done by breaking up big problems into smaller sub problems. This would allow for the updates to be completed in a more efficient manner. 
Claim 13 is a system claim, A system comprising memory for storing instructions, and a processor configured to execute the instructions to: (Kanazawa, Figure 12, “FIG. 12 illustrates an example computing environment with an example computer device suitable for use in some example implementations… 
    PNG
    media_image2.png
    486
    583
    media_image2.png
    Greyscale
”), that corresponds to method claim 5 respectively. Otherwise, they are not patentably distinguishable. Therefore, claim 13 is rejected for the same reasons as claim 5, respectively.


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to UYEN-NHU PHAM TRAN whose telephone number is (571)272-1559. The examiner can normally be reached Monday - Friday 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached at (571) 270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/U.P.T./Examiner, Art Unit 2124                                                                                                                                                                                                        
/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124
Read full office action
Non-Uniform Pessimistic Reinforcement Learning

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Non-Uniform Pessimistic Reinforcement Learning

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email