Office Action Analysis: 18008838 — TRAINING AN ACTION SELECTION SYSTEM USING RELATIVE ENTROPY Q-LEARNING

Office Action

§101 §103
DETAILED ACTION
This action is responsive to the claims filed on 10/14/2025. Claims 23-42 are pending for examination.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statements (IDS) submitted on 04/19/2024 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 23-42 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Statutory Categories
Claims 23-35 are directed to an method.
Claim 36 is directed to an system.
Claims 37-42 are directed to an computer-readable medium.

Independent Claim 23, 36, and 37
Step 2A Prong 1: Does the claim recite an abstract idea, law of nature, or natural phenomenon?
Yes. Independent claim 23, 36, and 38 recites limitations that are abstract ideas in the form of mental processes:
Claim 23 recites:
for each experience tuple, determining a state value for the second observation in the experience tuple by importance sampling; (this limitation merely comprises a mathematical analysis of data and is being considered as directed to a mathematical concept, see MPEP 2106.04(a), paragraph 9 of this application’s specification outlines the mathematical procedure for this step)
and determining an update to current values of a set of Q neural network parameters of the Q neural network using the state values for the second observations in the experience tuples; (this limitation merely comprises a mathematical analysis of data and is being considered as directed to a mathematical concept, see MPEP 2106.04(a), paragraph 9-24 of this application’s specification outlines the mathematical procedure for this step)
the method further comprising: determining the state value for the second observation as a linear combination of the Q values for importance sampled actions by: (this limitation merely comprises a mathematical analysis of data and is being considered as directed to a mathematical concept, see MPEP 2106.04(a), paragraph 9-11 of this application’s specification outlines the mathematical procedure for this step)
determining a temperature factor based on the Q values for the sampled actions;  (this limitation recites determining a value based on predetermined values, stated at a high level with no further indication as to how the determination should be performed, which can reasonably be performed as a mental process or with aid of pen and paper)
determining a respective modified Q value for each sampled action as a ratio of: (i) the Q value for the sampled action, and (ii) the temperature factor; (this limitation merely comprises a mathematical analysis of data and is being considered as directed to a mathematical concept, see MPEP 2106.04(a), paragraph 24 of this application’s specification outlines the mathematical procedure for this step)
applying a softmax function to the modified Q values to determine a weight factor for each sampled action; (this limitation merely comprises a mathematical analysis of data and is being considered as directed to a mathematical concept, see MPEP 2106.04(a), paragraph 8 of this application’s specification outlines the mathematical procedure for this step)
and determining the state value for the second observation as a linear combination of the Q values for the sampled action, wherein the Q value for each sampled action is scaled by the weight factor for the sampled action. ((this limitation merely comprises a mathematical analysis of data and is being considered as directed to a mathematical concept, see MPEP 2106.04(a), paragraph 9 of this application’s specification outlines the mathematical procedure for this step))
This claim further recites the following additional elements for the purposes of Step 2A Prong Two analysis:
A method performed by one or more data processing apparatus for training an action selection system that is used to select actions to be performed by an agent interacting with an environment to perform a task, wherein the action selection system comprises a Q neural network and a policy neural network, (this limitation invokes Q neural networks and policy neural networks merely as a tool to perform an existing process and is considered as mere instructions to apply an exception, see MPEP 2106.05(f))
the method comprising, at each of a plurality of iterations: obtaining a batch of experience tuples characterizing previous interactions of a simulated or real-world version of the agent with the environment from a replay buffer, wherein each experience tuple comprises: (i) a first observation characterizing a state of the environment, (ii) an action performed by the agent in response to the first observation, (iii) a second observation characterizing a state of the environment after the agent performs the action in response to the first observation, and (iv) a reward received as a result of the agent performing the action in response to the first observation; (this limitation merely comprises data gathering and is considered insignificant extra-solution activity under MPEP 2106.05(g))
The additional limitations fail step 2A Prong 2 of the 101 analysis because they do not transform the claim into a practical application. These limitations are too abstract or lack technical improvement that would make the concept practically useful. Without clear utility or integration into a specific field, the claim does not relate to any particular application. It does not meet the requirements of Step 2A Prong 2, as it fails to make the concept meaningfully applicable in practice. Since the claim as a whole, looking at the additional elements individually and in combination, does not contain any other additional elements that are indicative of integration into a practical application, the claim is directed to an abstract idea.
This claim recites the following additional elements for the purposes of Step 2B analysis:
A method performed by one or more data processing apparatus for training an action selection system that is used to select actions to be performed by an agent interacting with an environment to perform a task, wherein the action selection system comprises a Q neural network and a policy neural network, (this limitation invokes Q neural networks and policy neural networks merely as a tool to perform an existing process and is considered as mere instructions to apply an exception, see MPEP 2106.05(f))
the method comprising, at each of a plurality of iterations: obtaining a batch of experience tuples characterizing previous interactions of a simulated or real-world version of the agent with the environment from a replay buffer, wherein each experience tuple comprises: (i) a first observation characterizing a state of the environment, (ii) an action performed by the agent in response to the first observation, (iii) a second observation characterizing a state of the environment after the agent performs the action in response to the first observation, and (iv) a reward received as a result of the agent performing the action in response to the first observation; (this limitation merely comprises data gathering and is considered insignificant extra-solution activity under MPEP 2106.05(g), for the purposes of step 2B, the insignificant extra-solution activity is receiving or transmitting data and is considered well-understood, routine, and conventional activity under MPEP 2106.05(d)(II) with buySAFE, Inc. v. Google, Inc., 765 F.3d 1350, 1355, 112 USPQ2d 1093, 1096 (Fed. Cir. 2014) (computer receives and sends information over a network))
The claim also fails Step 2B of the analysis because the additional limitations do not amount to significantly more than the abstract idea itself. The additional limitations do not enhance the claim in a way that would move it beyond its abstract ideas as they minimally elaborate on the core concept without adding any inventive or technical substance. Considering the additional elements individually and in combination, and the claim as a whole, the additional elements do not provide significantly more than the abstract idea. Therefore, the claim is not patent eligible.
Claim 36 and 37 recite additional limitations for consideration:
A system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform (Under step 2A prong II and step 2B, this limitation invokes computers and machinery merely as a tool to perform an existing process and is considered as mere instructions to apply an exception using generic computer, see MPEP 2106.05(f))
One or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform (Under step 2A prong II and step 2B, this limitation invokes computers and machinery merely as a tool to perform an existing process and is considered as mere instructions to apply an exception using generic computer, see MPEP 2106.05(f))
Dependents of Claims 23, 36, and 37
The remaining dependent claims corresponding to independent claims 23, 36, and 37 do not recite additional elements, whether considered individually or in combination, that are sufficient to integrate the judicial exception into a practical application or amount to significantly more than the judicial exception. The analysis of which is shown below:
The claims below recite additional limitations which fail step 2A Prong 2 of the 101 analysis because they do not transform the claim into a practical application. These limitations are too abstract or lack technical improvement that would make the concept practically useful. Without clear utility or integration into a specific field, the claim does not relate to any particular application. It does not meet the requirements of Step 2A Prong 2, as it fails to make the concept meaningfully applicable in practice.
The claims also fails Step 2B of the analysis because the additional limitations do not amount to significantly more than the abstract idea itself. The additional limitations do not enhance the claim in a way that would move it beyond its abstract ideas as they minimally elaborate on the core concept without adding any inventive or technical substance. The claims are unpatentable.

Claim 24 recites the further limitation of:
The method of claim 23, wherein the state value for the second observation is computed as: 
    PNG
    media_image1.png
    44
    141
    media_image1.png
    Greyscale
wherein                     
                        
                            
                                V
                            
                            
                                π
                            
                        
                        (
                        g
                        )
                    
                 is the state value for the second observation, j indexes the sampled actions, M is a number of sampled actions,                     
                        
                            
                                w
                            
                            
                                j
                            
                        
                    
                 is the weight factor for sampled action 
    PNG
    media_image2.png
    27
    67
    media_image2.png
    Greyscale
 is the Q value for sampled action                     
                        
                            
                                a
                            
                            
                                1
                            
                        
                    
                 (this limitation merely comprises a mathematical analysis of data and is being considered as directed to a mathematical concept, see MPEP 2106.04(a))
and each weight factor Ai is computed as: 
    PNG
    media_image3.png
    74
    162
    media_image3.png
    Greyscale
wherein k indexes the sampled actions and V. is the temperature factor. (this limitation merely comprises a mathematical analysis of data and is being considered as directed to a mathematical concept, see MPEP 2106.04(a))
Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.
Claim 25 recites the further limitation of:
The method of claim 23, wherein determining the temperature factor based on the Q values for the sampled actions comprises, at each of one or more optimization iterations: determining a gradient of a dual function with respect to the temperature factor, wherein the dual function depends on: (i) the temperature factor, and (ii) the Q values for the sampled actions; (this limitation merely comprises a mathematical analysis of data and is being considered as directed to a mathematical concept, see MPEP 2106.04(a), paragraph 11 of this application’s specification outlines the mathematical procedure for this step)
adjusting a current value of the temperature factor using the gradient of the dual function with respect to the temperature factor.  (this limitation merely comprises a mathematical analysis of data and is being considered as directed to a mathematical concept, see MPEP 2106.04(a), paragraph 11 of this application’s specification outlines the mathematical procedure for this step)
Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible


Claim 26 recites the further limitation of:
The method of claim 25, wherein the dual function is computed as: 
    PNG
    media_image4.png
    49
    251
    media_image4.png
    Greyscale
 wherein g(ns) is the dual function evaluated for temperature factor ns,                     
                        |
                        B
                        |
                    
                 denotes a number of experience tuples in the batch of experience tuples,                     
                        ϵ
                    
                 is a regularization parameter, j indexes the sampled actions, M is a number of sampled actions, and 
    PNG
    media_image5.png
    71
    134
    media_image5.png
    Greyscale
 is the Q value for sampled action aj. (this limitation merely comprises a mathematical analysis of data and is being considered as directed to a mathematical concept)
Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible
Claim 27 recites the further limitation of:
The method of claim 23, wherein determining an update to current values of a set of Q neural network parameters of the Q neural network using the state values for the second observations in the experience tuples comprises: for each experience tuple: processing a first observation in the experience tuple using the Q neural network to generate a Q value for the action in the experience tuple; (this limitation merely comprises a mathematical analysis of data and is being considered as directed to a mathematical concept, see MPEP 2106.04(a), paragraph 16-19 of this application’s specification outlines the mathematical procedure for this step)
and determining a target Q value for the action in the experience tuple using the state value for the second observation in the experience tuple; (this limitation merely comprises a mathematical analysis of data and is being considered as directed to a mathematical concept, see MPEP 2106.04(a), paragraph 16-19 of this application’s specification outlines the mathematical procedure for this step)
determining a gradient of a Q objective function that, for each experience tuple, measures an error between: (i) the Q value for the action in the experience tuple, and (ii) the target Q value for the action in the experience tuple; (this limitation merely comprises a mathematical analysis of data and is being considered as directed to a mathematical concept, see MPEP 2106.04(a), paragraph 16-19 of this application’s specification outlines the mathematical procedure for this step)and determining the 
update to the current values of the set of Q neural network parameters using the gradient. (this limitation merely comprises a mathematical analysis of data and is being considered as directed to a mathematical concept, see MPEP 2106.04(a), paragraph 16-19 of this application’s specification outlines the mathematical procedure for this step)
Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible
Claim 28 recites the further limitation of:
The method of claim 27, wherein determining the target Q value for the action in the experience tuple using the state value for the second observation in the experience tuple comprises: determining the target Q value as a sum of: (i) the reward in the experience tuple, and (ii) a product of a discount factor and the state value for the second observation in the experience tuple; and/or wherein the error between: (i) the Q value for the action in the experience tuple, and (ii) the target Q value for the action in the experience tuple, comprises a squared error between: (i) the Q value for the action in the experience tuple, and (ii) the target Q value for the action in the experience tuple.  (this limitation merely comprises a mathematical analysis of data and is being considered as directed to a mathematical concept, see MPEP 2106.04(a), paragraph 15 of this application’s specification outlines the mathematical procedure for this step)
Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible
Claim 29 recites the further limitation of:
The method of claim 28, wherein the Q objective function is computed as: 
    PNG
    media_image6.png
    45
    211
    media_image6.png
    Greyscale
 wherein |B| is a number of experience tuples in the batch of experience tuples, each (s, a, r, s’) is an experience tuple in the batch of experience tuples B, wherein s is the first observation, a is the action, r is the reward, and s' is the second observation,                     
                        γ
                    
                 is a discount factor,                     
                        
                            
                                V
                            
                            
                                π
                            
                        
                        
                            
                                s
                            
                            
                                '
                            
                        
                         
                    
                is the state value for the second observation in the experience tuple, and Qϕ(a,s) is the Q value for the action in the experience tuple. (this limitation merely comprises a mathematical analysis of data and is being considered as directed to a mathematical concept, see MPEP 2106.04(a), paragraph 19 of this application’s specification outlines the mathematical procedure for this step)
Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible
Claim 30 recites the further limitation of:
The method of claim 23, further comprising, at each of the plurality of iterations, determining an update to current values of a set of policy neural network parameters of the policy neural network, comprising: for each experience tuple: processing the first observation in the experience tuple using the Q neural network to generate a Q value for the action in the experience tuple; determining a state value for the first observation in the experience tuple; (this limitation merely comprises a mathematical analysis of data and is being considered as directed to a mathematical concept, see MPEP 2106.04(a), paragraph 16-19 of this application’s specification outlines the mathematical procedure for this step)
and determining an advantage value for the experience tuple as a difference between: (i) the Q value for the action in the experience tuple, and (ii) the state value for the first observation in the experience tuple; (this limitation merely comprises a mathematical analysis of data and is being considered as directed to a mathematical concept, see MPEP 2106.04(a), paragraph 16-19 of this application’s specification outlines the mathematical procedure for this step)
and determining the update to the current values of the set of policy neural network parameters of the policy neural network based on only the experience tuples having a non-negative advantage value. (this limitation merely comprises a mathematical analysis of data and is being considered as directed to a mathematical concept, see MPEP 2106.04(a), paragraph 16-19 of this application’s specification outlines the mathematical procedure for this step)
Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible
Claim 31 recites the further limitation of:
The method of claim 30, wherein determining the update to the current values of the set of policy neural network parameters of the policy neural network based on only the experience tuples having a non-negative advantage value comprises: determining a gradient of a policy objective function that depends on only the experience tuples having a non-negative advantage value; and determining the update to the current values of the set of policy neural network parameters using the gradient. (this limitation merely comprises a mathematical analysis of data and is being considered as directed to a mathematical concept, see MPEP 2106.04(a), paragraph 16-19 of this application’s specification outlines the mathematical procedure for this step)
Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.

Claim 32 recites the further limitation of:
The method of claim 31, wherein for each experience tuple having a non-negative advantage value, the policy objective function depends on an action score for the action in the experience tuple that is generated by processing a first observation in the experience tuple using the policy neural network; in particular wherein the policy objective function is computed as: 
    PNG
    media_image7.png
    43
    210
    media_image7.png
    Greyscale
 wherein |B| is a number of experience tuples in the batch of experience tuples, each (s, a, r) is an experience tuple in the batch of experience tuples B, wherein s is the first observation, a is the action, and r is the reward, I[-] is an indicator function,                     
                        
                            
                                A
                            
                            
                                π
                            
                        
                        (
                        a
                        ,
                        s
                        )
                    
                 is an advantage value for the experience tuple , and                     
                        
                            
                                π
                            
                            
                                θ
                            
                        
                        (
                        a
                        |
                        s
                        )
                    
                  is the action score for the action in the experience tuple that is generated by processing the first observation in the experience tuple using the policy network. (this limitation merely comprises a mathematical analysis of data and is being considered as directed to a mathematical concept, see MPEP 2106.04(a),)
 Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.

Claim 33 recites the further limitation of:
The method of claim 23, further comprising, at each of one or more of the plurality of iterations: generating a plurality of new experience tuples using the action selection system, an expert action selection policy, or both; and adding the new experience tuples to the replay buffer; (generating experience tuples by merely using an action selection system is being considered a mental proves of evaluation which can reasonably be performed in human mind)
wherein generating a plurality of new experience tuples comprises, at each of one or more time steps: receiving a current observation for the time step; (For the purposes of Step 2A Prong II and Step 2B: this limitation merely recites data gathering steps which is considered insignificant extra-solution activity under MPEP 2106.05(g), for the purposes of step 2B, the insignificant extra-solution activity is receiving or transmitting data and is considered well-understood, routine, and conventional activity under MPEP 2106.05(d)(II) with buySAFE, Inc. v. Google, Inc., 765 F.3d 1350, 1355, 112 USPQ2d 1093, 1096 (Fed. Cir. 2014) (computer receives and sends information over a network))
selecting an action to be performed by the agent at the time step using the action selection system or the expert action selection policy; (selecting an action by merely using an action selection system is being considered a mental proves of evaluation which can reasonably be performed in human mind)
receiving a next observation and a reward resulting from the agent performing the selected action; (For the purposes of Step 2A Prong II and Step 2B: this limitation merely recites data gathering steps which is considered insignificant extra-solution activity under MPEP 2106.05(g), for the purposes of step 2B, the insignificant extra-solution activity is receiving or transmitting data and is considered well-understood, routine, and conventional activity under MPEP 2106.05(d)(II) with buySAFE, Inc. v. Google, Inc., 765 F.3d 1350, 1355, 112 USPQ2d 1093, 1096 (Fed. Cir. 2014) (computer receives and sends information over a network))
and generating a new experience tuple comprising the current observation, the selected action, the next observation, and the reward; (generating experience tuples by merely using an action selection system is being considered a mental proves of evaluation which can reasonably be performed in human mind)
in particular wherein selecting the action to be performed by the agent at the time step using the action selection system or the expert action selection policy comprises stochastically selecting between using the action selection system or the expert action selection policy to select the action to be performed by the agent at the time step. (a selection being further limited to a stochastic type is still being considered a mental proves of evaluation which can reasonably be performed in human mind)
 Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.

Claim 34 recites the further limitation of:
The method of claim 33, wherein selecting an action to be performed by the agent at a time step using the action selection system comprises: processing the current observation for the time step using the policy neural network to generate a respective action score for each action in the set of possible actions; (this limitation merely comprises a mathematical analysis of data and is being considered as directed to a mathematical concept, see MPEP 2106.04(a), paragraph 18-19 of this application’s specification outlines the mathematical procedure for this step)
processing the current observation for the time step using the Q neural network to generate a respective Q value for each action in the set of possible actions; (this limitation merely comprises a mathematical analysis of data and is being considered as directed to a mathematical concept, see MPEP 2106.04(a), paragraph 21-24 of this application’s specification outlines the mathematical procedure for this step)
determining a final action score for each action based on: (i) the action score for the action, and (ii) the Q value for the action; (this limitation merely comprises a mathematical analysis of data and is being considered as directed to a mathematical concept, see MPEP 2106.04(a), paragraph 24 of this application’s specification outlines the mathematical procedure for this step)
and selecting the action to be performed by the agent in accordance with the final action scores. (a selection being further limited to be based on predetermined values is still being considered a mental proves of evaluation which can reasonably be performed in human mind)
Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.

Claim 35 recites the further limitation of:
The method of claim 34, wherein the final action score for an action is computed as: 
    PNG
    media_image8.png
    40
    126
    media_image8.png
    Greyscale
 wherein                     
                        π
                        (
                        a
                        |
                        s
                        )
                    
                 is the action score for the action, Q(s,a) is the Q value for the action, and                     
                        
                            
                                n
                            
                            
                                s
                            
                        
                    
                 is a temperature parameter (this limitation merely comprises a mathematical analysis of data and is being considered as directed to a mathematical concept, see MPEP 2106.04(a))
Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 23-24, 27-28, 30-38, and 41-42, are rejected under 35 U.S.C. 103 as being unpatentable by Wang et al., (Wang, Z., Novikov, A., Zolna, K., Merel, J. S., Springenberg, J. T., Reed, S. E., ... & De Freitas, N. (2020). Critic regularized regression. Advances in Neural Information Processing Systems, 33, 7768-7778.), hereafter referred to as Wang, in view of Haarnoja et al, (Haarnoja, T., Tang, H., Abbeel, P., & Levine, S. (2017, July). Reinforcement learning with deep energy-based policies. In International conference on machine learning (pp. 1352-1361). PMLR.), hereafter referred to as Haarnoja, and in further view of Asadi et al, (Asadi, K., & Littman, M. L. (2017, July). An alternative softmax operator for reinforcement learning. In International Conference on Machine Learning (pp. 243-252). PMLR.), hereafter referred to as Asadi.

Claim 23: Wang teaches the following limitations:
A method performed by one or more data processing apparatus for training an action selection system that is used to select actions to be performed by an agent interacting with an environment to perform a task, wherein the action selection system comprises a Q neural network and a policy neural network, (Wang, page 2, figure 1,

    PNG
    media_image9.png
    207
    411
    media_image9.png
    Greyscale

Figure 1 of Wang, the main idea behind CRR (Critic Regularized Regression)
“Figure 1: Illustration of the main idea behind CRR. The task is to reach the reward from the starting position as fast as possible. Consider learning a policy from the suboptimal (red/green) trajectory. For every state st, the action proposed by the current (suboptimal) policy π(st) is shown with black arrows. CRR compares the critic prediction of the value Q(st, at) of the action at from the trajectory against the value Q(st, π(st)) of the action from the policy π. If Q(st, at) ≥ Q(st, π(st)), the corresponding action is marked green and the pair (st, at) is used to train the policy”
page 4, paragraph 1, “Algorithm 1: Critic Regularized Regression… Input: Dataset B, critic net Qθ, actor net πφ, target actor and critic nets: πφ0 , Qθ , function f”, Wang explicitly identifies an actor network (the policy network) and a critic network (the Q network).)
the method further comprising: determining the state value for the second observation as a linear combination of the Q values for importance sampled actions by: determining a temperature factor based on the Q values for the sampled actions; (Wang, page 5, paragraph 3, “We first sample actions a1:n from πφ(·|s), weight the different actions by their importance weights exp(Qθ(s, ai)/β)”, Wang explicitly uses a parameter β inside an exponential weighting term exp(Q(s, ai)/β) while weighting sampled actions using their Q values. In the claim, the “temperature factor” is the factor used in combination with the Q values to control the weighting; Wang’s β is that temperature factor because it appears in the denominator with the Q values (i.e., Q/β) in the weighting expression.)
determining a respective modified Q value for each sampled action as a ratio of: (i) the Q value for the sampled action, and (ii) the temperature factor; (Wang, page 5, paragraph 3, “We first sample actions a1:n from πφ(·|s), weight the different actions by their importance weights exp(Qθ(s, ai)/β)”, Wang’s importance-weight expression exp(Q(s, ai)/β) explicitly contains Q(s, ai)/β, which is the claimed “modified Q value” formed as a ratio of the Q value for a sampled action (Q(s, ai)) divided by a temperature factor (β).) 
applying a softmax function to the modified Q values to determine a weight factor for each sampled action; (Wang, page 5, paragraph 3, “…and finally choose an action by re-sampling with probabilities 
    PNG
    media_image10.png
    18
    284
    media_image10.png
    Greyscale
”, Wang explicitly normalizes exponentiated Q(s, ai)/β values into per-action probabilities P(ai) by dividing exp(Q(s, ai)/β) by the sum of exp(Q(s, aj)/β) over sampled actions. It is interpreted that this is the claimed “softmax function” output (a normalized exponential weight) and Wang’s P(ai) is the claimed “weight factor for each sampled action.”)
Haarnoja in the same field of Q network implementation, teaches the following limitations which Wang fails to teach:
the method comprising, at each of a plurality of iterations: obtaining a batch of experience tuples characterizing previous interactions of a simulated or real-world version of the agent with the environment from a replay buffer, wherein each experience tuple comprises: (i) a first observation characterizing a state of the environment, (ii) an action performed by the agent in response to the first observation, (iii) a second observation characterizing a state of the environment after the agent performs the action in response to the first observation, and (iv) a reward received as a result of the agent performing the action in response to the first observation; (Haarnoja, page 5, algorithm 1,

    PNG
    media_image11.png
    538
    324
    media_image11.png
    Greyscale

Algorithm 1 of Haarnoja, Soft-Q function
Haarnoja’s Algorithm 1 teaches that each stored experience includes a first observation (state), an action taken in response to that first observation, a resulting next observation, and a reward. Specifically, Algorithm 1 describes collecting experience by selecting an action “a_t” for the current state “s_t,” sampling the next state “s_(t+1)” from the environment after taking the action, and then saving the new experience in the replay memory as: “D ← D ∪ {(s_t, a_t, r(s_t, a_t), s_(t+1))}.” In this explicit tuple, “s_t” corresponds to the claimed “first observation,” “a_t” corresponds to the claimed “action performed … in response to the first observation,” “s_(t+1)” corresponds to the claimed “second observation … after the agent performs the action,” and “r(s_t, a_t)” corresponds to the claimed “reward received as a result of the agent performing the action.”)
for each experience tuple, determining a state value for the second observation in the experience tuple by importance sampling; (Haarnoja, page 4, section 3.2, “To convert Theorem 3 into a stochastic optimization problem, we first express the soft value function in terms of an expectation via importance sampling 
    PNG
    media_image12.png
    42
    299
    media_image12.png
    Greyscale
”, Haarnoja uses importance sampling to compute a value function (state value) as an expectation using an importance-sampling form.)
and determining an update to current values of a set of Q neural network parameters of the Q neural network using the state values for the second observations in the experience tuples; (Haarnoja, page 5, algorithm 1, as shown in Haarnoja’s algorithm 1 above, Haarnoja explicitly computes a next-state value for s(t+1) (the “second observation”), computes the Q-loss gradient, then updates θ (Q parameters) using that value.)
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to incorporate Haarnoja’s replay-buffer minibatch training into Wang’s critic-regularized actor learning, because both references train neural-network-based reinforcement learning systems using experience data, and Haarnoja’s use of a replay memory and minibatch sampling provides a way to train the networks using batches of stored experience transitions within such a system. A motivation of which would have been to enable Wang’s training to use stored experience transitions and minibatch sampling during iterative updates. Haarnoja supports this motivation by expressly stating that experience is stored in replay memory and that parameters are updated using random minibatches from that memory: “the experience is stored in a replay memory buffer D … the parameters are updated using random minibatches from this memory.” (Haarnooja, page 5, section 3.4)
Asadi, in the same field of reinforcement learning, teaches the following limitations which Wang fails to teach:
and determining the state value for the second observation as a linear combination of the Q values for the sampled action, wherein the Q value for each sampled action is scaled by the weight factor for the sampled action. (Asadi, page 1, col. 2, “
    PNG
    media_image13.png
    40
    151
    media_image13.png
    Greyscale
”, Asadi defines the Boltzmann softmax operator boltzβ(X), which is a linear combination of the inputs xi, where each input xi is scaled by a corresponding normalized exponential weight e^{βxi}/(∑ e^{βx}). If the vector X is instantiated as the set of Q values for the sampled actions at the second observation (i.e., xi = Q(s′, a_i)), then Asadi’s operator yields the claimed “state value” computed as a weighted sum of those Q values, scaled by the corresponding weight factors.)
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate Asadi’s explicit normalized exponential weighting operator into the Wang and Haarnoja because Wang applies normalized exponential weighting based on Q values (e.g., exp(Q/β) normalized across actions) and Asadi provides an explicit formula for computing a normalized exponential weighting and its associated weighted combination of values. A motivation of which would have been to implement the normalized exponential weighting and weighted combination in a direct, explicit operator form consistent with the Q-based weighting described in Wang. Asadi supports this motivation by explicitly defining the Boltzmann operator as a normalized exponential weighting over a set of values (Asadi, page 1, col. 2, “
    PNG
    media_image13.png
    40
    151
    media_image13.png
    Greyscale
).


    PNG
    media_image14.png
    119
    519
    media_image14.png
    Greyscale

Table 1 of Wang, Advantage Estimates from different methods
Claim 24: Wang, Haarnoja, and Asadi teaches the limitations of claim 23, Wang further teaches the following limitations:
The method of claim 23, wherein the state value for the second observation is computed as: 
    PNG
    media_image1.png
    44
    141
    media_image1.png
    Greyscale
wherein                     
                        
                            
                                V
                            
                            
                                π
                            
                        
                        (
                        g
                        )
                    
                 is the state value for the second observation, j indexes the sampled actions, M is a number of sampled actions,                     
                        
                            
                                w
                            
                            
                                j
                            
                        
                    
                 is the weight factor for sampled action 
    PNG
    media_image2.png
    27
    67
    media_image2.png
    Greyscale
 is the Q value for sampled action                     
                        
                            
                                a
                            
                            
                                1
                            
                        
                    
                 (Wang, table 1, Wang’s CRR explicitly uses a linear combination (average) of Q-values over sampled actions aj.)
and each weight factor Ai is computed as: 
    PNG
    media_image3.png
    74
    162
    media_image3.png
    Greyscale
wherein k indexes the sampled actions and V. is the temperature factor. (Wang, page 5, paragraph 3, “We can use this policy instead of π during action selection. To sample from q¯, we use importance sampling. We first sample actions a1:n from πφ(·|s), weight the different actions by their importance weights exp(Qθ(s, ai)/β) and finally choose an action by re-sampling with probabilities 
    PNG
    media_image10.png
    18
    284
    media_image10.png
    Greyscale
”, Wang explicitly gives the softmax normalization over exp(Q/temperature) (“β”), matching the claim’s weight definition.)
Claim 27: Wang, Haarnoja, and Asadi teaches the limitations of claim 23, Haarnoja further teaches the following limitations:
The method of claim 23, wherein determining an update to current values of a set of Q neural network parameters of the Q neural network using the state values for the second observations in the experience tuples comprises: for each experience tuple: processing a first observation in the experience tuple using the Q neural network to generate a Q value for the action in the experience tuple; (Haarnoja, page 5, algorithm 1, “Sample a minibatch from the replay memory {(s (i) t , a (i) t , r (i) t , s (i) t+1)} N i=0 ∼ D.”
Page 4, col. 1, section 3.2, paragraph 3, “we can express the soft Q-iteration in an equivalent form as minimizing 
    PNG
    media_image15.png
    42
    385
    media_image15.png
    Greyscale
”, Haarnoja’s training is explicitly over minibatched tuples including s t and a t; its Q objective is defined on Q( st, at ) in the computed JQ, i.e., generating a Q value for the action at that observation.)
and determining a target Q value for the action in the experience tuple using the state value for the second observation in the experience tuple; (Haarnoja, Page 4, col. 1, section 3.2, paragraph 3, “we can express the soft Q-iteration in an equivalent form as minimizing 
    PNG
    media_image15.png
    42
    385
    media_image15.png
    Greyscale
 where qst , qat are positive over S and A respectively, Qˆ                    
                        
                            
                                θ
                            
                            -
                        
                    
                 soft(st, at) = rt + γEst+1∼ps [V                     
                        
                            
                                θ
                            
                            -
                        
                    
                 soft(st+1)] is a target Qvalue, with V                     
                        
                            
                                θ
                            
                            -
                        
                    
                 soft(st+1) given by (10) and θ being replaced by the target parameters,                     
                        
                            
                                θ
                            
                            -
                        
                    
                ”, Haarnoja explicitly defines an error between Qθ(st,at) and a target Q̂. Target Q̂ uses next-state value V(st+1) (the state value for the second observation))
determining a gradient of a Q objective function that, for each experience tuple, measures an error between: (i) the Q value for the action in the experience tuple, and (ii) the target Q value for the action in the experience tuple; (Haarnoja, page 5, algorithm 1, ”Compute empirical soft gradient                     
                        
                            
                                ∇
                            
                            ^
                        
                    
                 θJQ”
Page 4, col. 1, section 3.2, paragraph 3, “we can express the soft Q-iteration in an equivalent form as minimizing  
    PNG
    media_image15.png
    42
    385
    media_image15.png
    Greyscale
, where qst , qat are positive over S and A respectively, Qˆ            
                
                    
                        θ
                    
                    -
                
            
         soft(st, at) = rt + γEst+1∼ps [V             
                
                    
                        θ
                    
                    -
                
            
         soft(st+1)] is a target Qvalue, with V             
                
                    
                        θ
                    
                    -
                
            
         soft(st+1) given by (10) and θ being replaced by the target parameters,             
                
                    
                        θ
                    
                    -
                
            
        .”, Haarnoja’s computed soft gradient measures an error between the Q value (Qθ soft) and the target Q value Qˆθ¯ soft.)
and determining the update to the current values of the set of Q neural network parameters using the gradient. (Haarnooja, page 5, algorithm 1, “
    PNG
    media_image16.png
    112
    345
    media_image16.png
    Greyscale
”, the computed gradient of JQ are used to update the Q neural network parameters.)

Claim 28: Wang, Haarnoja, and Asadi teaches the limitations of claim 27, Haarnoja further teaches the following limitations:
The method of claim 27, wherein determining the target Q value for the action in the experience tuple using the state value for the second observation in the experience tuple comprises: determining the target Q value as a sum of: (i) the reward in the experience tuple, and (ii) a product of a discount factor and the state value for the second observation in the experience tuple; and/or wherein the error between: (i) the Q value for the action in the experience tuple, and (ii) the target Q value for the action in the experience tuple, comprises a squared error between: (i) the Q value for the action in the experience tuple, and (ii) the target Q value for the action in the experience tuple. (Haarnoja, Page 4, col. 1, section 3.2, paragraph 3, “we can express the soft Q-iteration in an equivalent form as minimizing 
    PNG
    media_image15.png
    42
    385
    media_image15.png
    Greyscale
”, Haarnoja’s error between the Q value and target Q value comprises a squared error between the Q value and the target Q value, matching the claimed “wherein the error between: (i) the Q value for the action in the experience tuple, and (ii) the target Q value for the action in the experience tuple, comprises a squared error between: (i) the Q value for the action in the experience tuple, and (ii) the target Q value for the action in the experience tuple.”)

Claim 30: Wang, Haarnoja, and Asadi teaches the limitations of claim 23, Haarnoja further teaches the following limitations:
The method of claim 23, further comprising, at each of the plurality of iterations, determining an update to current values of a set of policy neural network parameters of the policy neural network, comprising: for each experience tuple: processing the first observation in the experience tuple using the Q neural network to generate a Q value for the action in the experience tuple; determining a state value for the first observation in the experience tuple; (Haarnoja, page 4, col. 1, paragraph 1, “
    PNG
    media_image17.png
    73
    399
    media_image17.png
    Greyscale
”, as used in algorithm 1, which includes updating policies above, tuples for each iteration t is computed a Q value (Q soft function) and a state value (V soft function))
and determining an advantage value for the experience tuple as a difference between: (i) the Q value for the action in the experience tuple, and (ii) the state value for the first observation in the experience tuple; (Haarnoja, page 4, col. 2, last paragraph, “We denote the induced distribution of the actions as π φ (at|st), and we want to find parameters φ so that the induced distribution approximates the energy-based distribution in terms of the KL divergence 
    PNG
    media_image18.png
    72
    380
    media_image18.png
    Greyscale
”, an advantage value Jπ(φ; st) is computed using the difference between Q value for an experience tuple and the state value for the first observation in the experience tuple)
Wang further teaches:
and determining the update to the current values of the set of policy neural network parameters of the policy neural network based on only the experience tuples having a non-negative advantage value. (Wang, page 4, paragraph 1, “Provided Q is sufficiently accurate for (s, a) ∈ B (e.g. learned using Eq. (1)), we can consider additional choices of f that enable off-policy learning to overcome this problem: f := 1[Aˆ θ(s, a) > 0],”, Wang explicitly defines a filter f as an indicator requiring advantage > 0, which is exactly “only tuples having a non-negative advantage value.”)
Claim 31: Wang, Haarnoja, and Asadi teaches the limitations of claim 23, Wang further teaches the following limitations:
The method of claim 30, wherein determining the update to the current values of the set of policy neural network parameters of the policy neural network based on only the experience tuples having a non-negative advantage value comprises: determining a gradient of a policy objective function that depends on only the experience tuples having a non-negative advantage value; and determining the update to the current values of the set of policy neural network parameters using the gradient. (Wang, page 4, paragraph 1, “Update actor (policy) with gradient:
    PNG
    media_image19.png
    20
    274
    media_image19.png
    Greyscale
”, 
Wang, page 4, paragraph 1, “Provided Q is sufficiently accurate for (s, a) ∈ B (e.g. learned using Eq. (1)), we can consider additional choices of f that enable off-policy learning to overcome this problem: f := 1[Aˆ θ(s, a) > 0],”, Wang explicitly uses an actor gradient whose sums are multiplied by f, and f can be the [Â>0] indicator, as disclosed above; thus the objective/gradient depends only on positive-advantage tuples which updates the policy.)
Claim 32: Wang, Haarnoja, and Asadi teaches the limitations of claim 28, Wang further teaches the following limitations:
The method of claim 31, wherein for each experience tuple having a non-negative advantage value, the policy objective function depends on an action score for the action in the experience tuple that is generated by processing a first observation in the experience tuple using the policy neural network; in particular wherein the policy objective function is computed as: 
    PNG
    media_image7.png
    43
    210
    media_image7.png
    Greyscale
 wherein |B| is a number of experience tuples in the batch of experience tuples, each (s, a, r) is an experience tuple in the batch of experience tuples B, wherein s is the first observation, a is the action, and r is the reward, I[-] is an indicator function,                     
                        
                            
                                A
                            
                            
                                π
                            
                        
                        (
                        a
                        ,
                        s
                        )
                    
                 is an advantage value for the experience tuple , and                     
                        
                            
                                π
                            
                            
                                θ
                            
                        
                        (
                        a
                        |
                        s
                        )
                    
                  is the action score for the action in the experience tuple that is generated by processing the first observation in the experience tuple using the policy network. (Wang, page 4, paragraph 1, “Update actor (policy) with gradient:
    PNG
    media_image19.png
    20
    274
    media_image19.png
    Greyscale
”, 
Wang, page 4, paragraph 1, “Provided Q is sufficiently accurate for (s, a) ∈ B (e.g. learned using Eq. (1)), we can consider additional choices of f that enable off-policy learning to overcome this problem: f := 1[Aˆ θ(s, a) > 0],”, Wang uses an actor gradient whose terms match similarly to those claimed; f can be the [Â(a,s)>0] indicator, which will match the claimed function.)
Claim 33: Wang, Haarnoja, and Asadi teaches the limitations of claim 23, Haarnooja further teaches the following limitations:
The method of claim 23, further comprising, at each of one or more of the plurality of iterations: generating a plurality of new experience tuples using the action selection system, an expert action selection policy, or both; and adding the new experience tuples to the replay buffer; (Haarnoja, page 5, section 3.4, “The algorithm proceeds by alternating between collecting new experience from the environment, and updating the soft Q-function and sampling network parameters. The experience is stored in a replay memory buffer D as standard in deep Q-learning (Mnih et al., 2013)”
Haarnoja, page 5, algorithm 1, 
    PNG
    media_image20.png
    138
    330
    media_image20.png
    Greyscale
Haarnoja explicitly has a replay memory D, stores tuples (st, at, r(·), st+1) (matching observation/action/reward/next-observation), and samples a minibatch each iteration.)
wherein generating a plurality of new experience tuples comprises, at each of one or more time steps: receiving a current observation for the time step; selecting an action to be performed by the agent at the time step using the action selection system or the expert action selection policy; receiving a next observation and a reward resulting from the agent performing the selected action; and generating a new experience tuple comprising the current observation, the selected action, the next observation, and the reward; (Haarnoja, page 5, algorithm 1, 
    PNG
    media_image20.png
    138
    330
    media_image20.png
    Greyscale
, as shown in the “collect experience” for an iteration t of algorithm 1, an action is sampled to be performed by the agent at the time step, a current observation s_t, next observation s_t+1 are received to generate a reward r which generates a new experience tuple comprising all aforementioned terms.)
in particular wherein selecting the action to be performed by the agent at the time step using the action selection system or the expert action selection policy comprises stochastically selecting between using the action selection system or the expert action selection policy to select the action to be performed by the agent at the time step. (Haarnooja, page 6, section 5.1, “The stochastic policy samples actions closely following the energy landscape, hence learning diverse trajectories that lead to all four goals”, the action selection system (action sampling) is based on a stochastic policy.)

Claim 34: Wang, Haarnoja, and Asadi teaches the limitations of claim 23, Wang further teaches the following limitations:
The method of claim 33, wherein selecting an action to be performed by the agent at a time step using the action selection system comprises: processing the current observation for the time step using the policy neural network to generate a respective action score for each action in the set of possible actions; (Wang, page 4, paragraph 1, “for n_updates do… Update actor (policy) with gradient:
    PNG
    media_image19.png
    20
    274
    media_image19.png
    Greyscale
”, Wang uses an actor gradient computed over each n_update (iteration over actions). It is interpreted that log πφ(a i t |s i t ) corresponds to the respective action score for each action)
processing the current observation for the time step using the Q neural network to generate a respective Q value for each action in the set of possible actions; (Wang, page 4, paragraph 1, “for n_updates do… Update actor (policy) with gradient:
    PNG
    media_image19.png
    20
    274
    media_image19.png
    Greyscale
”, Wang uses an actor gradient computed over each n_update (iteration over actions). It is interpreted that Qθ corresponds to the respective Q value for each action)
determining a final action score for each action based on: (i) the action score for the action, and (ii) the Q value for the action; (Wang, page 4, paragraph 1, “for n_updates do… Update actor (policy) with gradient:
    PNG
    media_image19.png
    20
    274
    media_image19.png
    Greyscale
”, Wang uses an actor gradient computed over each n_update (iteration over actions). It is interpreted that updated policy value is the final action score.)
and selecting the action to be performed by the agent in accordance with the final action scores. (Wang, page 5, paragraph 3, “We can use this policy instead of π during action selection.”, explicit support for showing how the policy (whose update procedure was disclosed above), interpreted as the final action score, is used to select an action performed by the agent.)

Claim 35: Wang, Haarnoja, and Asadi teaches the limitations of claim 23, Wang further teaches the following limitations:
The method of claim 34, wherein the final action score for an action is computed as: 
    PNG
    media_image8.png
    40
    126
    media_image8.png
    Greyscale
 wherein                     
                        π
                        (
                        a
                        |
                        s
                        )
                    
                 is the action score for the action, Q(s,a) is the Q value for the action, and                     
                        
                            
                                n
                            
                            
                                s
                            
                        
                    
                 is a temperature parameter (Wang, page 5, paragraph 3, “The solution is given by… which yields 
    PNG
    media_image21.png
    28
    339
    media_image21.png
    Greyscale
”, a final action score computed in Wang, is identical to the action score defined in the claim.)

Claim 36 and 37 recite limitations substantially similar to claims 23, as such a similar analysis applies.
Claim 36 recites additional limitations for consideration:
A system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform (Haarnoja, page 7, figure 2, “Figure 2. Simulated robots used in our experiments.”, page 5, section 3.4, “The experience is stored in a replay memory buffer D as standard in deep Q-learning (Mnih et al., 2013), and the parameters are updated using random minibatches from this memory.”, a storage device (memory) is used by computer (simulated robot) to execute the instructions (algorithm 1))
Claim 37 recites additional limitations for consideration:
One or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform (Haarnoja, page 7, figure 2, “Figure 2. Simulated robots used in our experiments.”, page 5, section 3.4, “The experience is stored in a replay memory buffer D as standard in deep Q-learning (Mnih et al., 2013), and the parameters are updated using random minibatches from this memory.”, a storage media (memory) is used by computer (simulated robot) to execute the instructions (algorithm 1))

Claim 38 recites limitations substantially similar to claim 24, as such a similar analysis applies.
Claims 41-42 recite limitations substantially similar to claims 27-28, as such a similar analysis applies.

Claims 25-26 and 39 are rejected under 35 U.S.C. 103 as being unpatentable by Wang in view of Haarnoja and in further view of Asadi, as applied to claims above, and Peters et al., (Peters, J., Mulling, K., & Altun, Y. (2010, July). Relative entropy policy search. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 24, No. 1, pp. 1607-1612).), hereafter referred to as Peters.

    PNG
    media_image22.png
    505
    334
    media_image22.png
    Greyscale

Table 1 of Peters, Relative Entropy Policy Search 
Claim 25: Wang, Haarnoja, and Asadi teaches the limitations of claim 23, Wang, Haarnoja, and Asadi do not teach, however Peter teaches the following limitations:
The method of claim 23, wherein determining the temperature factor based on the Q values for the sampled actions comprises, at each of one or more optimization iterations: determining a gradient of a dual function with respect to the temperature factor, wherein the dual function depends on: (i) the temperature factor, and (ii) the Q values for the sampled actions; (Peters, page 1609, table 1, Peters REPS (Relative Entropy Policy Search) explicitly computes ∂η g (gradient/derivative of the dual function g w.r.t. η), matching “gradient … with respect to the temperature factor.”, where η is being interpreted to be the temperature factor.)
adjusting a current value of the temperature factor using the gradient of the dual function with respect to the temperature factor. (Peters, page 1609, table 1, REPS explicitly optimizes η using an optimizer that takes ∂g (includes ∂ηg). That is an explicit “adjusting a current value of the temperature factor using the gradient/derivative.”)
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate Peters’ dual-function optimization for a temperature-like parameter into the Wang, Haarnoja, and Asadi because the combined system uses normalized exponential weighting controlled by a temperature-like parameter (β, η, or its equivalent), and Peters provides an explicit method for computing and optimizing such a parameter through a dual function and its derivative. A motivation of which would have been to determine or adjust the temperature-like parameter used in the normalized exponential weighting by optimizing a dual function with respect to that parameter. Peters supports this motivation by expressly teaching computing the dual function, computing its derivative with respect to η, and optimizing using both: “Compute Dual Function: g(θ, η) … Compute the Dual Function’s Derivative … ∂η g … Optimize: (θ*, η*) = fmin BFGS(g, ∂g, [θ0, η0]).” (Peters, page 1609, table 1)

Claim 26: Wang, Haarnoja, Asadi and Peters teaches the limitations of claim 25, Peters further teaches the following limitations:
The method of claim 25, wherein the dual function is computed as: 
    PNG
    media_image4.png
    49
    251
    media_image4.png
    Greyscale
 wherein g(ns) is the dual function evaluated for temperature factor ns,                     
                        |
                        B
                        |
                    
                 denotes a number of experience tuples in the batch of experience tuples,                     
                        ϵ
                    
                 is a regularization parameter, j indexes the sampled actions, M is a number of sampled actions, and 
    PNG
    media_image5.png
    71
    134
    media_image5.png
    Greyscale
 is the Q value for sampled action aj. (Peters, page 1609, table 1, “Compute a Dual Function”, Peters REPS explicitly defines a dual function which matches the claimed η + log-sum-exp / log-mean-exp structure, but the exponent uses δθ(si,ai) (Bellman error sample), not “Q values” verbatim.)
Claim 39 recites limitations substantially similar to claims 25, as such a similar analysis applies.

Claim 29 is rejected under 35 U.S.C. 103 as being unpatentable by Wang in view of Haarnoja and in further view of Asadi, as applied to claims above, and Lillicrap et al., (Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., ... & Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.), hereafter referred to as Lillicrap.
Claim 29: Wang, Haarnoja, and Asadi teaches the limitations of claim 28, Lillicrap, in the same field of reinforcement learning, teaches the following limitations which Wang, Haarnoja, and Asadi fail to teach:
The method of claim 28, wherein the Q objective function is computed as: 
    PNG
    media_image6.png
    45
    211
    media_image6.png
    Greyscale
 wherein |B| is a number of experience tuples in the batch of experience tuples, each (s, a, r, s’) is an experience tuple in the batch of experience tuples B, wherein s is the first observation, a is the action, r is the reward, and s' is the second observation,                     
                        γ
                    
                 is a discount factor,                     
                        
                            
                                V
                            
                            
                                π
                            
                        
                        
                            
                                s
                            
                            
                                '
                            
                        
                         
                    
                is the state value for the second observation in the experience tuple, and Qϕ(a,s) is the Q value for the action in the experience tuple. (Lillicrap, page 5, algorithm 1, “
    PNG
    media_image23.png
    60
    472
    media_image23.png
    Greyscale
”, Lillicrap samples a random minibatch of N transitions                     
                        (
                        s
                        _
                        i
                        ,
                        a
                        _
                        i
                        ,
                        r
                        _
                        i
                        ,
                        s
                        _
                        {
                        i
                        +
                        1
                        }
                        )
                    
                 from replay buffer R, which corresponds to the claim’s batch                     
                        {
                        B
                        }
                    
                 of experience tuples (s,a,r,s'). Lillicrap defines the target y_i as shown in Eq. (1),                     
                        
                            
                                y
                            
                            
                                i
                            
                        
                         
                        =
                         
                        
                            
                                r
                            
                            
                                i
                            
                        
                         
                        +
                        γ
                         
                        Q
                        '
                        (
                        
                            
                                s
                            
                            
                                
                                    
                                        i
                                        +
                                        1
                                    
                                
                            
                        
                        ,
                         
                        μ
                        '
                        (
                        
                            
                                s
                            
                            
                                
                                    
                                        i
                                        +
                                        1
                                    
                                
                            
                        
                        )
                        )
                    
                
which matches the claim’s target form             
                r
                 
                +
                γ
                 
                
                    
                        V
                    
                    
                        
                            
                                π
                            
                        
                        
                            
                                
                                    
                                        s
                                    
                                    
                                        '
                                    
                                
                            
                        
                    
                
            
         because both include a reward term and a discounted next-state value term evaluated at the next state. Lillicrap then updates the critic by minimizing the loss in 
Eq. (2),             
                L
                =
                
                    
                        1
                    
                    
                        N
                    
                
                s
                u
                
                    
                        m
                    
                    
                        
                            
                                i
                                =
                                1
                            
                        
                    
                    
                        
                            
                                N
                            
                        
                    
                
                
                    
                        
                            
                                
                                    
                                        y
                                    
                                    
                                        i
                                    
                                
                                -
                                 
                                Q
                                
                                    
                                        
                                            
                                                s
                                            
                                            
                                                i
                                            
                                        
                                        ,
                                        
                                            
                                                a
                                            
                                            
                                                i
                                                ∣
                                                
                                                    
                                                        θ
                                                    
                                                    
                                                        Q
                                                        '
                                                    
                                                
                                            
                                        
                                    
                                
                            
                        
                    
                    
                        2
                    
                
            
        , 
which corresponds to the claim’s batch-averaged squared error expression, 
    PNG
    media_image6.png
    45
    211
    media_image6.png
    Greyscale
, because both compute a squared difference between a target value and a Q-value over a batch of sampled experience tuples.)
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate Lillicrap’s explicit batch-averaged critic loss form into the Wang, Haarnoja, and Asadi system because the combined system trains a critic/Q function using batches of sampled transitions, and Lillicrap provides an explicit formulation that computes the critic update by minimizing a batch-averaged squared loss over a sampled minibatch of transitions. A motivation of which would have been to express and implement the critic/Q training objective as an explicit minibatch-average loss computed over sampled transitions. Lillicrap supports this motivation by expressly teaching sampling a minibatch of transitions and minimizing a loss written with a 1/N summation: “Sample a random minibatch of N transitions … Update critic by minimizing the loss: L = 1/N ∑ (y_i − Q(s_i, a_i|θQ))^2.” (Lillicrap, page 5, algorithm 1)


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Abdolmaleki, A., Springenberg, J. T., Tassa, Y., Munos, R., Heess, N., & Riedmiller, M. (2018). Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920.
Peng, X. B., Kumar, A., Zhang, G., & Levine, S. (2019). Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177.
Ross, S., Gordon, G., & Bagnell, D. (2011, June). A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics (pp. 627-635). JMLR Workshop and Conference Proceedings.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HYUNGJUN B YI whose telephone number is (703)756-4799. The examiner can normally be reached M-F 9-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Usmaan Saeed can be reached at (571) 272-4046. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/H.B.Y./Examiner, Art Unit 2146

/USMAAN SAEED/Supervisory Patent Examiner, Art Unit 2146
Read full office action
TRAINING AN ACTION SELECTION SYSTEM USING RELATIVE ENTROPY Q-LEARNING

This examiner grants 18% of cases after interview

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

TRAINING AN ACTION SELECTION SYSTEM USING RELATIVE ENTROPY Q-LEARNING

This examiner grants 18% of cases after interview

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email