Office Action Analysis: 17843288 — LEARNING ROBOTIC SKILLS WITH IMITATION AND REINFORCEMENT AT SCALE

Office Action

§103
DETAILED ACTION
This Office Action is sent in response to the Applicant’s Communication received on 10/16/2025 for application number 17/843,288. The Office hereby acknowledges receipt of the following and placed of record in file: Specification, Drawings, Abstract, Oath/Declaration, IDS, and Claims.
Claim 1 ,12, and 13 are amended.
Claims 10 and 11 are canceled. 
Claims 1-9 and 12-20 are pending.	
	

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
35 USC 103
	On page 11 of the remarks section, the Applicant argues that the single using a "scripted policy for the first 120K gradient update steps, then switching to" a "noisy policy", is a single "switch[]" after "120K gradient update steps". In contrast, amended claim 1 sets forth "selecting ... a selected exploration strategy" ''for each step of multiple steps of the robotic episode" and that ''performing each of the robotic episodes" includes such "selecting'. Put another way, the single "switch[]" after "120K ... steps" set forth in Kalashnikov fails to render obvious "selecting ... a selected exploration strategy" ''for each step of multiple steps of the robotic episode".
Examiner respectfully disagrees. Applicant’s argument is not persuasive because the broadest reasonable interpretation (BRI) is broader than what is argued.  Under BRI, Kalashnikov does indeed teach "selecting ... a selected exploration strategy" ''for each step of multiple steps of the robotic episode" in Appx. C, pg. 16: “In these experiments, we use 60 simulated robots, using πscripted policy for the first 120K gradient update steps, then switching to the πnoisy policy with ∈ = 0.2. These exploration policies are explained in Appendix B. The grasp performance is evaluated continuously and concurrently as training proceeds by running the πeval policy on 100 separate simulated robots and aggregating 700 grasps per policy for each model checkpoint.” To further clarify, each of the plurality of “simulated robots” – 60 to be exact (“each step of the multiple steps of the robotic episode”), performing a switch ("selecting ... a selected exploration strategy") meets the limitations.
On page 11, the Applicant further argues that the "scripted" policy in cited aspects of Kalashnikov is not, like the "robotic episodes" of claim 1, ''performed based on the actor network and/or the critic network”. Rather, it is described as "randomly choosing an (x, y) coordinate above the table, lowering the open gripper to table level in a few random descent steps, closing the gripper, then returning to the original height in a few ascent steps".
Examiner respectfully disagrees. The Office Action did not cite Kalashnikov as teaching the claimed limitations related to robotic episodes “performed based on the actor network and/or the critic network”. Rather, analogous reference Vecerik was brought in as an obvious combination to teach the limitations of robotic episodes being performed based on the actor and/or critic network. Specifically, Vecerik teaches in the abstract: “We present results of simulation experiments on a set of robot insertion problems involving rigid and flexible objects (performing robotic episodes).; Sect 2, para 1, Deep Deterministic Policy Gradient (DDPG) [7] is an actor-critic algorithm which directly uses the gradient of the Q-function w.r.t. the action to train the policy. DDPG maintains a parameterized policy network π(.|θπ) (actor function) and a parameterized action-value function network (critic function) Q(.|θQ). It produces new transitions”. In response to applicant's arguments against the references individually, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references.  See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986). Therefore, the rejection to claim 1 in maintained.
Applicant presents similar arguments as claim 1 for claim 18, and are therefore maintained for similar reasons.
Applicant further argues that paragraph [0004] of Zhang discloses "at a critic neural network, receiving the current state of the environment and the continuous action output by each respective actor neural network and outputting a state-action value" but the paragraph fails to disclose “processing the state data and each of multiple candidate actions sampled using CEM, using the critic network, to generate a corresponding candidate action measure for each of the candidate actions” as set forth in claim 20. It follows that cited para. [0004] of Zhang(3) fails to disclose "determining, from amongst the actor action measure and the corresponding candidate action measures, a maximum measure" as it fails to disclose "each of multiple candidate actions sampled using CEM". Further, although cited aspects of Kalashnikov are relied upon for "CEM", they are not relied upon as allegedly disclosing "determining, from amongst the actor action measure and the corresponding candidate action measures, a maximum measure".
Examiner respectfully disagrees. The Office Action does not rely on Zhang(3) alone to teach the claimed limitation “processing the state data and each of multiple candidate actions sampled using CEM, using the critic network, to generate a corresponding candidate action measure for each of the candidate actions”. Similarly, the Office Action does not rely on Kalashnikov alone to teach the claimed limitation "determining, from amongst the actor action measure and the corresponding candidate action measures, a maximum measure". Specifically, Zhang(3) teaches “processing state data of the instance, using the actor network, to generate actor network output” in paragraph 0004: “at each actor neural network among a plurality of actor neural networks, receiving a current state of the environment for a time step and outputting a continuous action for the current state based on a deterministic policy approximated by the actor neural network, thereby outputting a plurality of continuous actions”. Kalashnikov teaches “sampling using CEM” in the abstract: “QT-Opt… can leverage over 580k real-world grasp attempts to train a deep neural network”. Specifically, to further clarify, Kalashnikov teaches in Sect 4.2, para 2: “CEM is a simple derivative-free optimization algorithm that samples a batch of N values at each iteration, fits a Gaussian distribution to the best M < N of these samples, and then samples the next batch of N from”. Zhang(3) teaches “determining, from amongst the actor action measure and the corresponding candidate action measures, a maximum measure” in paragraph 0004: “selecting (determining), at an action selector (from amongst the actor action measure), from among the plurality of continuous actions, a continuous action, wherein the selected continuous action is associated with a state-action (corresponding candidate action measures) value that is maximum (a maximum measure) among the plurality of state-action values”. In response to applicant's arguments against the references individually, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references.  See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986). Therefore, the rejection to claim 1 in maintained. Therefore, the claim 20 rejection is maintained.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claim(s) 1-9, 12-14 and 17-19 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang (CN 112668235 A, see attached translation), hereinafter Zhang, in view of Haarnoja et al. (Soft Actor-Critic Algorithms and Applications, published 2019), hereinafter Haarnoja, Nair et al. (AWAC: Accelerating Online Reinforcement Learning with Offline Datasets, published June 16, 2020), hereinafter Nair, Zhang et al. (Pretraining Deep Actor-Critic Reinforcement Learning Algorithms With Expert Demonstrations, published 2018), hereinafter Zhang(2), Kalshnikov et al. (QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation, published 2018), hereinafter Kalashnikov, Vecerik et al. (Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards, published 2018), hereinafter Vecerik, and Yan et al. (Learning Probabilistic Multi-Modal Actor Models
for Vision-Based Robotic Grasping, published 2019), hereinafter Yan.

Regarding claim 1, Zhang teaches,
A method implemented by one or more processors [Fig. 1], the method comprising: pre-training (Para 0133, pre-trains) an actor network and a critic network (Para 0133, the DDPG network offline; Para 0078, Generally speaking, the DDPG network application is based on the Actor-Critic method) using reinforcement learning (Para 0133, reinforcement learning) and offline (Para 0070, offline environment) robotic demonstration data (Para 0071, number of training rounds) from demonstrated robotic episodes (Para 0071, the 2D dummy, reaching the end point from the starting point) [Para 0070, Step 1: Collect the training data of 2D dummies in an offline environment and preprocess the training data to obtain a training data set; Para 0071, The DDPG algorithm and the improved DDPG algorithm based on the offline model were used to train for 4000 rounds respectively, and the relationship between the feedback reward value of the robot, i.e. the 2D dummy, reaching the end point from the starting point and the number of training rounds was analyzed; Para 0133, The present invention firstly utilizes a large amount of offline data to train the object state model and reward model, and then pre-trains the DDPG network offline through a model-based reinforcement learning method to improve the decision-making ability of the network offline, thereby accelerating the subsequent online learning efficiency and performance; Para 0078, Generally speaking, the DDPG network application is based on the Actor-Critic method, so it has a policy neural network and a value-based neural network. It includes a policy network for generating actions and a value network for judging the quality of actions]

Zhang teaches the limitations 1, but does not teach wherein the actor network is a first neural network model that represents a policy, wherein the critic network is a second neural network model that represents a Q-function and wherein pre-training the actor network and the critic network comprises: pre-training the actor network using an advantage-weighted regression training objective, the advantage-weighted regression training objective utilizing generated using the second neural network model pre-training the critic network based on the robotic demonstration data, using Q-learning and a cross-entropy method (CEM), subsequent to pre-training the actor network and pre-training the critic network: performing robotic episodes, each performed based on the actor network and/or the critic network, wherein performing each of the robotic episodes comprises: for each step of multiple steps of the robotic episode: selecting, from at least a first exploration strategy and a second exploration strategy, a selected exploration strategy for the step: and determining a robotic action to perform, for the step, according to the selected exploration strategy: further training the actor network and the critic network using reinforcement learning and online episode data from robotic episodes each performed based on the actor network and/or the critic network, wherein further training the actor network and the critic network comprises: further training the actor network based on a first set of the episode data, and using the advantage-weighted regression training objective, and further training the critic network based on a second set of the episode data, and using Q-learning and the CEM, wherein the second set includes a given quantity of unsuccessful episode data, that is from unsuccessful episodes of the robotic episodes, and wherein the given quantity is greater than an alternate quantity, of the unsuccessful episode data, that is included in the first set.

Haarnoja teaches,
wherein the actor network is a first neural network model (Sect 4.1, para 1, neural networks) that represents (Sect 4.1, para 1, given by) a policy (Sect 4.1, para 1, the policy) [Sect 4.1, para 1, the policy as a Gaussian with mean and covariance given by neural networks], 
wherein the critic network is a second neural network model that represents a Q-function [Sect 4.1, para 1, the soft Q-function can be modeled as expressive neural networks], 
Haarnoja is analogous to the claimed invention as they both relate to Reinforcement Learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang’s teachings to incorporate the teachings of Haarnoja and provide the actor network representing a policy and the critic network representing a Q-function in order to attain optimal results from the combined methodologies of policy usage and exploration.

Zhang-Haarnoja teach the limitations of claim 1, but do not teach wherein pre-training the actor network and the critic network comprises: pre-training the actor network using an advantage-weighted regression training objective, the advantage-weighted regression training objective utilizing values generated using the second neural network model, pre-training the critic network based on the robotic demonstration data, using Q-learning and a cross-entropy method (CEM), subsequent to pre-training the actor network and pre-training the critic network: further training the actor network and the critic network using reinforcement learning and online episode data from robotic episodes each performed based on the actor network and/or the critic network, wherein further training the actor network and the critic network comprises: performing robotic episodes, each performed based on the actor network and/or the critic network, wherein performing each of the robotic episodes comprises: for each step of multiple steps of the robotic episode: selecting, from at least a first exploration strategy and a second exploration strategy, a selected exploration strategy for the step: and determining a robotic action to perform, for the step, according to the selected exploration strategy: further training the actor network based on a first set of the episode data, and using the advantage-weighted regression training objective, and further training the critic network based on a second set of the episode data, and using Q-learning and the CEM, wherein the second set includes a given quantity of unsuccessful episode data, that is from unsuccessful episodes of the robotic episodes, and wherein the given quantity is greater than an alternate quantity, of the unsuccessful episode data, that is included in the first set.

Nair teaches,
and wherein pre-training the actor network and the critic network comprises: pre-training (Sect I, pg. 2, col 1, para 2, pre-training) the actor network (Sect IV, pg. 5, col 2, para 1, we can parameterize the actor… by neural networks) using an advantage-weighted regression training objective (Sect I, pg. 2, col 1, para 2, AWAC), the advantage-weighted regression training objective utilizing (Sect II, col 2, para 2, updated based on) values (Sect II, col 2, para 2, current estimate of Qπ) generated (Sect II, col 2, para 2, estimated) using the second neural network model (Sect II, col 2, para 2, the critic Qπ (s, a)) [Sect I, pg. 1, col 2, para 4, In this work, we study how to build RL algorithms that are effective for pre-training from off-policy datasets; Sect I, pg. 2, col 1, para 2, The contribution of this work is not just another RL algorithm, but a systematic study of what makes offline pre-training with online fine-tuning unique compared to the standard RL paradigm, which then directly motivates a simple algorithm, AWAC, to address these challenges; Sect II, col 2, para 2, During the policy evaluation phase, the critic Qπ (s, a) is estimated for the current policy π… During policy improvement, the actor π is typically updated based on the current estimate of Qπ; Sect IV, pg. 5, col 2, para 1, we can parameterize the actor and the critic by neural networks], 
Nair is analogous to the claimed invention as they both relate to Reinforcement Learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang and Haarnoja’s teachings to incorporate the teachings of Nair and provide an advantage-weighted regression training objective [Nair, Abstract] in order to enable rapid learning of skills with a combination of prior demonstration data and online experience.

Zhang-Haarnoja-Nair teach the limitations of claim 1 do not teach pre-training the critic network based on the robotic demonstration data, using Q-learning and a cross-entropy method (CEM), subsequent to pre-training the actor network and pre-training the critic network: performing robotic episodes, each performed based on the actor network and/or the critic network, wherein performing each of the robotic episodes comprises: for each step of multiple steps of the robotic episode: selecting, from at least a first exploration strategy and a second exploration strategy, a selected exploration strategy for the step: and determining a robotic action to perform, for the step, according to the selected exploration strategy: further training the actor network and the critic network using reinforcement learning and online episode data from robotic episodes each performed based on the actor network and/or the critic network, wherein further training the actor network and the critic network comprises: further training the actor network based on a first set of the episode data, and using the advantage-weighted regression training objective, and further training the critic network based on a second set of the episode data, and using Q-learning and the CEM, wherein the second set includes a given quantity of unsuccessful episode data, that is from unsuccessful episodes of the robotic episodes, and wherein the given quantity is greater than an alternate quantity, of the unsuccessful episode data, that is included in the first set.

Zhang(2) teaches,
pre-training (Sect 4, pg. 4, para 6, To pretrain) the critic network (Sect 4, pg. 4, para 6, actor-critic RL algorithms like DDPG and ACER) based on the robotic demonstration data (Sect 4, pg. 4, col 1, para 2, HalfCheetah (6D), Hopper (3D), and Walker2d (6D)) [Sect 4, pg. 4, col 1, para 2, Therefore if there exist some expert demonstrations that perform better than initial policies, we can introduce the data using constraint (3), in order to obtain a more accurate estimator Qw(s, a); Sect 5, para 4, With DDPG as baseline, we apply our algorithm to low dimensional simulation environments using the MuJoCo physics engine [Todorov et al., 2012], and test on tasks with action dimensionality are: HalfCheetah (6D), Hopper (3D), and Walker2d (6D); Sect 4, pg. 4, para 6, To pretrain actor-critic RL algorithms like DDPG and ACER, we add gradients                         
                            
                                
                                    g
                                
                                
                                    Q
                                
                                
                                    *
                                
                            
                        
                     and                         
                            
                                
                                    g
                                
                                
                                    π
                                
                                
                                    *
                                
                            
                        
                     to the original gradients of the algorithms; Sect 4.1, para 2, Two neural networks are used in DDPG at the same time. One is named critic network] 
Zhang(2) is analogous to the claimed invention as they both relate to Reinforcement Learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Haarnoja, and Nair’s teachings to incorporate the teachings of Zhang(2) and provide pre-training the critic network based on demonstration data [Zhang(2), Abstract] in order to speed up the training process.

Zhang-Haarnoja-Nair-Zhang(2) teach the limitations of claim 1 but do not teach using Q-learning and a cross-entropy method (CEM), subsequent to pre-training the actor network and pre-training the critic network: performing robotic episodes, each performed based on the actor network and/or the critic network, wherein performing each of the robotic episodes comprises: for each step of multiple steps of the robotic episode: selecting, from at least a first exploration strategy and a second exploration strategy, a selected exploration strategy for the step: and determining a robotic action to perform, for the step, according to the selected exploration strategy: further training the actor network and the critic network using reinforcement learning and online episode data from robotic episodes each performed based on the actor network and/or the critic network, wherein further training the actor network and the critic network comprises: further training the actor network based on a first set of the episode data, and using the advantage-weighted regression training objective, and further training the critic network based on a second set of the episode data, and using Q-learning and the CEM, wherein the second set includes a given quantity of unsuccessful episode data, that is from unsuccessful episodes of the robotic episodes, and wherein the given quantity is greater than an alternate quantity, of the unsuccessful episode data, that is included in the first set.

Kalashnikov teaches,
for each step of multiple steps of the robotic episode (Appx. C, pg. 16, grasps per policy for each model checkpoint) [Appx. C, pg. 16, In these experiments, we use 60 simulated robots, using πscripted policy for the first 120K gradient update steps, then switching to the πnoisy policy with                         
                            ∈
                        
                     = 0.2. These exploration policies are explained in Appendix B. The grasp performance is evaluated continuously and concurrently as training proceeds by running the πeval policy on 100 separate simulated robots and aggregating 700 grasps per policy for each model checkpoint.]: selecting [Appx. B, para 2-4, During the early stages of training… we collect our initial data for training using a scripted policy πscripted… During the later stages of training, we switch to data collection with πnoisy], from at least a first exploration strategy and a second exploration strategy, a selected exploration strategy [Appx. B, para 1, For data collection, we used two different exploration policies πscripted, πnoisy at different stages of training] for the step; 
determining a robotic action to perform, for the step, according to the selected exploration strategy [Appx. B, The πscripted simplifies the multi-step exploration of the problem by randomly choosing an (x, y) coordinate above the table, lowering the open gripper to table level in a few random descent steps, closing the gripper, then returning to the original height in a few ascent steps]
using Q-learning (Sect 1, pg. 2, para 2, Q-learning) and a cross-entropy method (CEM) (Sect 1, pg. 2, para 2, QTOpt), [Sect 1, pg. 2, para 2, To make maximal use of this diverse dataset, we propose an off-policy training method based on a continuous-action generalization of Q-learning, which we call QTOpt… QT-Opt dispenses with the need to train an explicit actor; Abstract, we introduce QT-Opt, a scalable self-supervised vision-based reinforcement learning framework that can leverage over 580k real-world grasp attempts to train a deep neural network] 
Kalashnikov is analogous to the claimed invention as they both relate to Reinforcement Learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Haarnoja, Nair, and Zhang(2)’s teachings to incorporate the teachings of Kalashnikov and provide using Q-learning and a CEM [Kalashnikov, sect 1, pg. 2, para 2] in order to improve instability issues.

Zhang-Haarnoja-Nair-Zhang(2)-Kalashnikov teach the limitations of claim 1 but do not teach subsequent to pre-training the actor network and pre-training the critic network: performing robotic episodes, each performed based on the actor network and/or the critic network, further training the actor network and the critic network using reinforcement learning and online episode data from robotic episodes each performed based on the actor network and/or the critic network, wherein further training the actor network and the critic network comprises: further training the actor network based on a first set of the episode data, and using the advantage-weighted regression training objective, and further training the critic network based on a second set of the episode data, and using Q-learning and the CEM, wherein the second set includes a given quantity of unsuccessful episode data, that is from unsuccessful episodes of the robotic episodes, and wherein the given quantity is greater than an alternate quantity, of the unsuccessful episode data, that is included in the first set.

Vecerik teaches,
subsequent to pre-training the actor network and pre-training the critic network: performing robotic episodes, each performed based on the actor network and/or the critic network [Abstract, We present results of simulation experiments on a set of robot insertion problems involving rigid and flexible objects; Sect 2, para 1, Deep Deterministic Policy Gradient (DDPG) [7] is an actor-critic algorithm which directly uses the gradient of the Q-function w.r.t. the action to train the policy. DDPG maintains a parameterized policy network π(.|θπ) (actor function) and a parameterized action-value function network (critic function) Q(.|θQ). It produces new transitions],
further training (Algorithm 1, Update) the actor network (Algorithm 1, Update the actor) and the critic network (Algorithm 1, Update the critic) using reinforcement learning (Sect 1, para 1, RL paradigms) and online episode data (Algorithm 1, Learning via interaction with the environment) from robotic episodes (Sect 1, para 1, kinesthetic demonstrations to guide a deep-RL algorithm) [Sect 1, para 1, In this paper we address this challenge by combining the demonstration and RL paradigms into a single framework which uses kinesthetic demonstrations to guide a deep-RL algorithm]

    PNG
    media_image1.png
    913
    1206
    media_image1.png
    Greyscale

each performed (produces new transitions) based on the actor network and/or the critic network (actor-critic algorithm which directly uses the gradient of the Q-function w.r.t. the action to train the policy), [Sect 2, para 1, Deep Deterministic Policy Gradient (DDPG) [7] is an actor-critic algorithm which directly uses the gradient of the Q-function w.r.t. the action to train the policy. DDPG maintains a parameterized policy network π(.|θπ) (actor function) and a parameterized action-value function network (critic function) Q(.|θQ). It produces new transitions]
Vecerik is analogous to the claimed invention as they both relate to Reinforcement Learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Haarnoja, Nair, Zhang(2), and Kalashnikov’s teachings to incorporate the teachings of Vecerik and provide training the actor and critic network using online episode data from robotic episodes [Vecerik, Abstract] in order to reduce common exploration issues.

Zhang-Haarnoja-Nair-Zhang(2)-Kalashnikov-Vecerik teach the limitations of claim 1 including using the advantage-weighted regression training objective, using Q-learning and the CEM (see above).

However, Zhang-Haarnoja-Nair-Zhang(2)-Kalashnikov-Vecerik do not teach wherein further training the actor network and the critic network comprises: further training the actor network based on a first set of the episode data, and further training the critic network based on a second set of the episode data, wherein the second set includes a given quantity of unsuccessful episode data, that is from unsuccessful episodes of the robotic episodes, and wherein the given quantity is greater than an alternate quantity, of the unsuccessful episode data, that is included in the first set.

Yan teaches,
wherein further training the actor network and the critic network comprises: further training the actor network based on a first set of the episode data [Sect VI (B), para 1, for training the actor model only successful grasps are used]
further training the critic network based on a second set of the episode data [Sect VI (B), para 1, For training the critic model, both successful and failed grasps are used]
wherein the second set includes a given quantity of unsuccessful episode data, that is from unsuccessful episodes of the robotic episodes [Sect VII, para 2, the density model normalizes over the action space, thus assumes every action that is not included in the dataset of successful grasps is failure], and
wherein the given quantity is greater than an alternate quantity, of the unsuccessful episode data, that is included in the first set [Sect VI (D), para 1, Both models are trained on the same dataset of real robot grasps, but for the actor model only successful grasps are used.]
Yan is analogous to the claimed invention as they both relate to reinforcement learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Haarnoja, Nair, Zhang(2), Kalashnikov, and Vecerik’s teachings to incorporate the teachings of Yan and provide training the actor network based on a first set of the episode data, and training the critic network based on a second set of the episode data, wherein the second set includes a given quantity of unsuccessful episode data, that is from unsuccessful episodes of the robotic episodes, and wherein the given quantity is greater than an alternate quantity, of the unsuccessful episode data, that is included in the first set in order to improve learning outcomes by drawing from diversified datasets.

Regarding claim 2, Zhang-Haarnoja-Nair-Zhang(2)-Kalashnikov-Vecerik-Yan teach the limitations of claim 1.

Yan further teaches,
wherein the alternate quantity is zero and wherein the first set includes only successful episode data that is from successful episodes of the robotic episodes [Sect VI (D), para 1, Both models are trained on the same dataset of real robot grasps, but for the actor model only successful grasps are used.].
Yan is analogous to the claimed invention as they both relate to reinforcement learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Haarnoja, Nair, Zhang(2), Kalashnikov, and Vecerik’s teachings to incorporate the teachings of Yan and provide a first set that only include successful episode data [Yan, Sect VI (D), para 1] in order to train exploration methods with data containing significantly higher rates of success.

Regarding claim 3, Zhang-Haarnoja-Nair-Zhang(2)-Kalashnikov-Vecerik-Yan teach the limitations of claims 1 and 2.

Yan further teaches,
wherein the second set includes the successful episode data that is also included in the first set and includes the unsuccessful episode data [Sect VI (D), para 1, Both models are trained on the same dataset of real robot grasps, but for the actor model only successful grasps are used.].
Yan is analogous to the claimed invention as they both relate to reinforcement learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Haarnoja, Nair, Zhang(2), Kalashnikov, and Vecerik’s teachings to incorporate the teachings of Yan and provide a second set including successful episode data that is also in the first set in order for policy updates to enhance decision making by drawing from exploration results and failure results.

Regarding claim 4, Zhang-Haarnoja-Nair-Zhang(2)-Kalashnikov-Vecerik-Yan teach the limitations of claims 1.

Yan further teaches,
wherein the alternate quantity, of the unsuccessful episode data, of the first set, is greater than zero, and wherein the unsuccessful episode data of the first set is a subset of the unsuccessful episode data that is included in the second set [Sect VI (B), para 1, For training the critic model, both successful and failed grasps are used, while for training the actor model only successful grasps are used. However, we report dataset size as the number of actions tried, including successful and failed ones, even for the actor model.].
Yan is analogous to the claimed invention as they both relate to reinforcement learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Haarnoja, Nair, Zhang(2), Kalashnikov, and Vecerik’s teachings to incorporate the teachings of Yan and provide the unsuccessful episode data of the first set being a subset of the unsuccessful episode data that is included in the second set [Yan, Sect VI (B), para 1] to allow for CEM optimization which enhances learning methods of the model.

Regarding claim 5, Zhang-Haarnoja-Nair-Zhang(2)-Kalashnikov-Vecerik-Yan teach the limitations of claims 1 and 4.

Yan further teaches,
wherein the ratio of the successful episode data to the unsuccessful episode data, included in the first set, is greater than three to one.
[Sect VI (D), para 2, The average success rate of each run using the three presented methods are summarized in the table below (Note: showing greater than three to one).]

    PNG
    media_image2.png
    86
    584
    media_image2.png
    Greyscale

Yan is analogous to the claimed invention as they both relate to reinforcement learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Haarnoja, Nair, Zhang(2), Kalashnikov, and Vecerik’s teachings to incorporate the teachings of Yan and provide a calculation of a success ratio in order to qualify the performance of the learning model.

Regarding claim 6, Zhang-Haarnoja-Nair-Zhang(2)-Kalashnikov-Vecerik-Yan teach the limitations of claims 1, 4, and 5.

Kalashnikov further teaches,
wherein the ratio of the successful episode data to the unsuccessful episode data, included in the first set, is greater than ten to one [Abstract, we introduce QT-Opt, a scalable self-supervised vision-based reinforcement learning framework that can leverage over 580k real-world grasp attempts to train a deep neural network Q-function with over 1.2M parameters to perform closed-loop, real-world grasping that generalizes to 96% grasp success on unseen objects.].
Kalshnikov is analogous to the claimed invention as they both relate to reinforcement learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Haarnoja, Nair, Zhang(2), Vecerik, and Yan’s teachings to incorporate the teachings of Kalshnikov and provide a calculation of a success ratio in order to qualify the performance of the learning model.

Regarding claim 7, Zhang-Haarnoja-Nair-Zhang(2)-Kalashnikov-Vecerik-Yan teach the limitations of claim 1.

Yan further teaches,
generating the first set based on data from the robotic episodes [Sect IV (B), para 2, collecting a dataset of successful grasps]; 
generating the second set based on filtering, from the first set, at least a majority of the unsuccessful episode data [Sect VI (D), para 1, We trained and evaluated the actor model and the critic model on real KUKA robots. Both models are trained on the same dataset of real robot grasps, but for the actor model only successful grasps are used.]. 
Yan is analogous to the claimed invention as they both relate to reinforcement learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Haarnoja, Nair, Zhang(2), Kalashnikov, and Vecerik’s teachings to incorporate the teachings of Yan and provide generating datasets for inputting into a machine learning model to perform its intended functionality.

Regarding claim 8, Zhang-Haarnoja-Nair-Zhang(2)-Kalashnikov-Vecerik-Yan teach the limitations of claims 1 and 7.

Vecerik further teaches, 
populating, over time, a replay buffer with the first set [Sect 2, para 1, DDPG maintains a parameterized policy network π(.|θπ) (actor function) and a parameterized action-value function network (critic function) Q(.|θπ). It produces new transitions e = (s, a, r = R(s, a), s’ ~ P(.|s, a)) by acting according to a = π (s|θπ) + N where N is a random process allowing action exploration. Those transitions are added to a replay buffer B.]; 
and wherein further training the actor network based on the first set comprises sampling the episode data of the first set from the replay buffer [Sect 3, para 1, The demonstrations are of the form of RL transitions: (s, a, s’, r). DDPGfD loads the demonstration transitions into the replay buffer before the training begins and keeps all transitions forever… Prioritized experience replay [13] modifies the agent to sample more important transitions from its replay buffer more frequently.].
Vecerik is analogous to the claimed invention as they both relate to reinforcement learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Haarnoja, Nair, Zhang(2), Kalashnikov, and Yan’s teachings to incorporate the teachings of Vecerik and provide the population and utilization of a replay buffer [Vecerik, Sect 3, para 1] in order to enable efficient propagation of the reward information (which is essential in problems with sparse rewards) and modify an agent to sample more important transitions using the replay buffer.

Regarding claim 9, Zhang-Haarnoja-Nair-Zhang(2)-Kalashnikov-Vecerik-Yan teach the limitations of claims 1, 7, and 8.

Vecerik further teaches,
Populating (Sect 3, para 1, loads) the replay buffer (Sect 3, para 1, replay buffer) with a goal to maintain a particular ratio of data (Sect 3, para 1, controlling the ratio), that is in the replay buffer (Sect 3, para 1, from its replay buffer) [Sect 3, para 1, DDPGfD loads the demonstration transitions into the replay buffer before the training begins and keeps all transitions forever. DDPGfD uses prioritized replay to enable efficient propagation of the reward information, which is essential in problems with sparse rewards. Prioritized experience replay [13] modifies the agent to sample more important transitions from its replay buffer more frequently. The probability of sampling a particular transition i is proportional to its priority… where pi is the priority of the transition… To account for the change in the distribution, updates to the network are weighted with importance sampling weights… DDPGfD uses a = 0.3 and B = 1 as we want to learn about the correct distribution from the very beginning. In addition, the prioritized replay is used to prioritize samples between the demonstration and agent data, controlling the ratio of data between the two in a natural way.]
Vecerik is analogous to the claimed invention as they both relate to reinforcement learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Haarnoja, Nair, Zhang(2), Kalashnikov, and Yan’s teachings to incorporate the teachings of Vecerik and provide maintaining a particular ratio of a replay buffer [Vecerik, Sect 3, para 1] in order to enable efficient propagation of the reward information (which is essential in problems with sparse rewards) and modify an agent to sample more important transitions.

Zhang-Haarnoja-Nair-Zhang(2)-Kalashnikov-Vecerik do not teach data being successful episode data to the unsuccessful episode data.

Yan further teaches,
data being successful episode data to the unsuccessful episode data [Sect VI (B), para 1, For training the critic model, both successful and failed grasps are used, while for training the actor model only successful grasps are used. However, we report dataset size as the number of actions tried, including successful and failed ones, even for the actor model]
Yan is analogous to the claimed invention as they both relate to reinforcement learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Haarnoja, Nair, Zhang(2), Kalashnikov, and Vecerik’s teachings to incorporate the teachings of Yan and provide the data being successful data to unsuccessful data in order to qualify the performance of the learning model.

Regarding claim 12, Zhang-Haarnoja-Nair-Zhang(2)-Kalashnikov-Vecerik-Yan teach the limitations of claims 1.

Kalashnikov further teaches,
wherein the first exploration strategy (Sect 5, para 5, scripted policy… described in… Appendix B) is a CEM policy (Sect 4.2, cross-entropy method (CEM)) in which CEM is performed [Sect 4.2, In our algorithm, which we call QT-Opt… We use the cross-entropy method (CEM) to perform this optimization; Sect 5, para 5, We switched to using the learned QT-Opt policy once it reached a success rate of 50%. The scripted policy is described in the supplementary material, in Appendix B.], 
and wherein the second exploration (Appx. B, para 4, πnoisy) strategy is a greedy (Appx. B, para 4, This exploration policy uses epsilon-greedy exploration) Gaussian policy in which a Gaussian probability distribution (Appx. B, para 4, Gaussian with probability) [Appx. B, para 4, During the later stages of training, we switch to data collection with πnoisy. This exploration policy uses epsilon-greedy exploration to trade off between choosing exploration actions or actions that maximize the Q-function estimate. The policy πnoisy chooses a random action with probability                         
                            ∈
                        
                     = 20%, otherwise the greedy action is chosen. To choose a random action, πnoisy samples a pose change t, r from a Gaussian with probability], 
Kalshnikov is analogous to the claimed invention as they both relate to reinforcement learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Haarnoja, Nair, Zhang(2), Vecerik, and Yan’s teachings to incorporate the teachings of Kalshnikov and provide multiple exploration strategies based on varying calculations [Kalshnikov, appx. C.1] in order to generate diverse datasets that result in improved policies and outcomes.

Zhang-Haarnoja-Nair-Zhang(2)-Kalashnikov-Vecerik do not teach, using the critic network and sampled actions, and results from the CEM are utilized in selecting an action; and exploration generated using the actor network based on a corresponding state and corresponding to candidate actions, is utilized in selecting an action.

Yan further teaches, 
using the critic network (Sect VI (D), using the critic model) and sampled actions (Sect VI (D), initial Gaussian), and results from the CEM (Sect VI (D), evaluating the CEM) are utilized in selecting an action (Sect VI (D), action with highest predicted value is selected) [Sect VI (D), For evaluating the CEM method using the critic model, we set the initial Gaussian to have a standard deviation of 15cm in horizontal direction, 6cm in vertical direction and 90◦ in rotation. This distribution is chosen to cover the space of the tray. The CEM is run for 3 iterations (see [15]) and the action with highest predicted value is selected.];
and exploration (Sect VI (D), a policy) generated (Sect VI (D), evaluated) using the actor network (Sect VI (D), actor model) based on a corresponding state (Sect VI (D), at each time step) and corresponding to candidate actions (Sect VI (D), and taking the action), is utilized in selecting an action (Sect VI (D), action scored highest by the critic model is selected) [Sect VI (D), We evaluate the actor model by predicting 64 samples at each time step and taking the action with the highest probability density… We also evaluated a policy that combines both the actor model and the critic model, where the actor model predicts 64 samples, and the samples are evaluated by the critic model. Finally the action scored highest by the critic model is selected.].
Yan is analogous to the claimed invention as they both relate to reinforcement learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Haarnoja, Nair, Zhang(2), Vecerik, and Kalshnikov’s teachings to incorporate the teachings of Yan and provide the various explorations strategies incorporated with actor-critic in order to efficiently adjust policy parameters and improve sample efficiency of reinforcement learning methodologies.

Regarding claim 13, Zhang-Haarnoja-Nair-Zhang(2)-Kalashnikov-Vecerik-Yan teach the limitations of claims 1.

Kalshnikov further teaches, 
selecting the first strategy at a first rate [Sect 5, para 5, This policy is randomized, but biased toward reasonable grasps, and achieves a success rate around 15-30%. We switched to using the learned QT-Opt policy once it reached a success rate of 50%. The scripted policy is described in the supplementary material, in Appendix B.] and selecting the second strategy at a second rate that is less than the first rate [Appx. B, para 4, πnoisy samples a pose change t; r from a Gaussian… a toggle gripper action gopen, gclose with probability 17%].
Kalshnikov is analogous to the claimed invention as they both relate to reinforcement learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Haarnoja, Nair, Zhang(2), Vecerik, and Yan’s teachings to incorporate the teachings of Kalshnikov and provide selecting the different exploration strategies at different rates [Kalshnikov, Appx. C.1] in order to optimize successful episode data by utilizing the various strategies with varying success rates at differing stages of training.

Regarding claim 14, Zhang-Haarnoja-Nair-Zhang(2)-Kalashnikov-Vecerik-Yan teach the limitations of claims 13.

Yan further teaches,
adjusting, the first rate (Sect VI (D), para 1, 3 iterations) and the second rate (Sect VI (D), para 1, predicting 64 samples at each time step) after performing at least a threshold quantity of the robotic episodes (Sect VI (D), para 1, highest predicted value; the highest probability density), wherein adjusting the first rate and the second rate comprises making the first rate and the second rate closer to one another (Sect VI (D), para 1, the actor model predicts 64 samples, and the samples are evaluated by the critic model) [Sect VI (D), para 1, We trained and evaluated the actor model and the critic model on real KUKA robots. Both models are trained on the same dataset of real robot grasps, but for the actor model only successful grasps are used. We evaluate the actor model by predicting 64 samples at each time step and taking the action with the highest probability density. For evaluating the CEM method using the critic model, we set the initial Gaussian to have a standard deviation of 15cm in horizontal direction, 6cm in vertical direction and 90◦ in rotation. This distribution is chosen to cover the space of the tray. The CEM is run for 3 iterations (see [15]) and the action with highest predicted value is selected. We also evaluated a policy that combines both the actor model and the critic model, where the actor model predicts 64 samples, and the samples are evaluated by the critic model. Finally the action scored highest by the critic model is selected].
Yan is analogous to the claimed invention as they both relate to reinforcement learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Haarnoja, Nair, Zhang(2), Vecerik, and Kalshnikov’s teachings to incorporate the teachings of Yan and provide a convergence adjustment of the rates [Kalshnikov, Appx. C.1] in order to collect data from varying behaviors, which improves episodic outcomes.

Regarding claim 18, Zhang teaches,
A method implemented by one or more processors [Fig. 1], the method comprising: pre-training (Para 0133, pre-trains) an actor network and a critic network (Para 0133, the DDPG network offline; Para 0078, Generally speaking, the DDPG network application is based on the Actor-Critic method) using reinforcement learning (Para 0133, reinforcement learning) and robotic demonstration data (Para 0071, number of training rounds) from demonstrated robotic episodes (Para 0071, the 2D dummy, reaching the end point from the starting point) [Para 0070, Step 1: Collect the training data of 2D dummies in an offline environment and preprocess the training data to obtain a training data set; Para 0071, The DDPG algorithm and the improved DDPG algorithm based on the offline model were used to train for 4000 rounds respectively, and the relationship between the feedback reward value of the robot, i.e. the 2D dummy, reaching the end point from the starting point and the number of training rounds was analyzed; Para 0133, The present invention firstly utilizes a large amount of offline data to train the object state model and reward model, and then pre-trains the DDPG network offline through a model-based reinforcement learning method to improve the decision-making ability of the network offline, thereby accelerating the subsequent online learning efficiency and performance; Para 0078, Generally speaking, the DDPG network application is based on the Actor-Critic method, so it has a policy neural network and a value-based neural network. It includes a policy network for generating actions and a value network for judging the quality of actions].

Zhang does not teach wherein the actor network is a first neural network model that represents a policy, wherein the critic network is a second neural network model that represents a Q-function, and wherein pre-training the actor network and the critic network comprises; pre-training the actor network using an advantage-weighted regression training objective, the advantage-weighted regression training objective utilizing values generated using the second neural network model, and pre-training the critic network based on the robotic demonstration data and using Q-learning and a cross-entropy method (CEM), subsequent to pre-training the actor network and pre-training the critic network: performing online robotic episodes based on the actor network and/or the critic network, wherein performing each of the robotic episodes comprises: selecting, from at least a first exploration strategy and a second exploration strategy, a selected exploration strategy for: the robotic episode as a whole, or each of multiple steps of the robotic episode; determining robotic actions to perform, in the robotic episode, according to the selecting; further training the actor network and the critic network using reinforcement learning and online episode data from the robotic episodes, wherein further training the actor network and the critic network comprises: further training the actor network based on a first set of the episode data, and using the advantage-weighted regression training objective, and further training the critic network based on a second set of the episode data, and using Q-learning and CEM.

Haarnoja teaches,
wherein the actor network is a first neural network model (Sect 4.1, para 1, neural networks) that represents (Sect 4.1, para 1, given by) a policy (Sect 4.1, para 1, the policy) [Sect 4.1, para 1, the policy as a Gaussian with mean and covariance given by neural networks], 
wherein the critic network is a second neural network model that represents a Q-function [Sect 4.1, para 1, the soft Q-function can be modeled as expressive neural networks], 
Haarnoja is analogous to the claimed invention as they both relate to Reinforcement Learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang’s teachings to incorporate the teachings of Haarnoja and provide the actor network representing a policy and the critic network representing a Q-function in order to attain optimal results from the combined methodologies of policy usage and exploration.

Zhang-Haarnoja do not teach wherein pre-training the actor network and the critic network comprises; pre-training the actor network using an advantage-weighted regression training objective, the advantage-weighted regression training objective utilizing values generated using the second neural network model, and pre-training the critic network based on the robotic demonstration data and using Q-learning and a cross-entropy method (CEM), subsequent to pre-training the actor network and pre-training the critic network: performing online robotic episodes based on the actor network and/or the critic network, wherein performing each of the robotic episodes comprises: selecting, from at least a first exploration strategy and a second exploration strategy, a selected exploration strategy for: the robotic episode as a whole, or each of multiple steps of the robotic episode; determining robotic actions to perform, in the robotic episode, according to the selecting; further training the actor network and the critic network using reinforcement learning and online episode data from the robotic episodes, wherein further training the actor network and the critic network comprises: further training the actor network based on a first set of the episode data, and using the advantage-weighted regression training objective, and further training the critic network based on a second set of the episode data, and using Q-learning and CEM.

Nair teaches,
and wherein pre-training the actor network and the critic network comprises: pre-training (Sect I, pg. 2, col 1, para 2, pre-training) the actor network (Sect IV, pg. 5, col 2, para 1, we can parameterize the actor… by neural networks) using an advantage-weighted regression training objective (Sect I, pg. 2, col 1, para 2, AWAC), the advantage-weighted regression training objective utilizing (Sect II, col 2, para 2, updated based on) values (Sect II, col 2, para 2, current estimate of Qπ) generated (Sect II, col 2, para 2, estimated) using the second neural network model (Sect II, col 2, para 2, the critic Qπ (s, a)) [Sect I, pg. 1, col 2, para 4, In this work, we study how to build RL algorithms that are effective for pre-training from off-policy datasets; Sect I, pg. 2, col 1, para 2, The contribution of this work is not just another RL algorithm, but a systematic study of what makes offline pre-training with online fine-tuning unique compared to the standard RL paradigm, which then directly motivates a simple algorithm, AWAC, to address these challenges; Sect II, col 2, para 2, During the policy evaluation phase, the critic Qπ (s, a) is estimated for the current policy π… During policy improvement, the actor π is typically updated based on the current estimate of Qπ; Sect IV, pg. 5, col 2, para 1, we can parameterize the actor and the critic by neural networks], 
Nair is analogous to the claimed invention as they both relate to Reinforcement Learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang and Haarnoja’s teachings to incorporate the teachings of Nair and provide an advantage-weighted regression training objective [Nair, Abstract] in order to enable rapid learning of skills with a combination of prior demonstration data and online experience.

Zhang-Haarnoja-Nair do not teach pre-training the critic network based on the robotic demonstration data and using Q-learning and a cross-entropy method (CEM), subsequent to pre-training the actor network and pre-training the critic network: performing online robotic episodes based on the actor network and/or the critic network, wherein performing each of the robotic episodes comprises: selecting, from at least a first exploration strategy and a second exploration strategy, a selected exploration strategy for: the robotic episode as a whole, or each of multiple steps of the robotic episode; determining robotic actions to perform, in the robotic episode, according to the selecting; further training the actor network and the critic network using reinforcement learning and online episode data from the robotic episodes, wherein further training the actor network and the critic network comprises: further training the actor network based on a first set of the episode data, and using the advantage-weighted regression training objective, and further training the critic network based on a second set of the episode data, and using Q-learning and CEM.

Zhang(2) teaches,
pre-training (Sect 4, pg. 4, para 6, To pretrain) the critic network (Sect 4, pg. 4, para 6, actor-critic RL algorithms like DDPG and ACER) based on the robotic demonstration data (Sect 4, pg. 4, col 1, para 2, HalfCheetah (6D), Hopper (3D), and Walker2d (6D)) [Sect 4, pg. 4, col 1, para 2, Therefore if there exist some expert demonstrations that perform better than initial policies, we can introduce the data using constraint (3), in order to obtain a more accurate estimator Qw(s, a); Sect 5, para 4, With DDPG as baseline, we apply our algorithm to low dimensional simulation environments using the MuJoCo physics engine [Todorov et al., 2012], and test on tasks with action dimensionality are: HalfCheetah (6D), Hopper (3D), and Walker2d (6D); Sect 4, pg. 4, para 6, To pretrain actor-critic RL algorithms like DDPG and ACER, we add gradients                         
                            
                                
                                    g
                                
                                
                                    Q
                                
                                
                                    *
                                
                            
                        
                     and                         
                            
                                
                                    g
                                
                                
                                    π
                                
                                
                                    *
                                
                            
                        
                     to the original gradients of the algorithms; Sect 4.1, para 2, Two neural networks are used in DDPG at the same time. One is named critic network] 
Zhang(2) is analogous to the claimed invention as they both relate to Reinforcement Learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Haarnoja, and Nair’s teachings to incorporate the teachings of Zhang(2) and provide pre-training the critic network based on demonstration data [Zhang(2), Abstract] in order to speed up the training process.

Zhang-Haarnoja-Nair-Zhang(2) do not teach using Q-learning and a cross-entropy method (CEM), subsequent to pre-training the actor network and pre-training the critic network: performing online robotic episodes based on the actor network and/or the critic network, wherein performing each of the robotic episodes comprises: selecting, from at least a first exploration strategy and a second exploration strategy, a selected exploration strategy for: the robotic episode as a whole, or each of multiple steps of the robotic episode; determining robotic actions to perform, in the robotic episode, according to the selecting; further training the actor network and the critic network using reinforcement learning and online episode data from the robotic episodes, wherein further training the actor network and the critic network comprises: further training the actor network based on a first set of the episode data, and using the advantage-weighted regression training objective, and further training the critic network based on a second set of the episode data, and using Q-learning and CEM.

Kalashnikov teaches,
using Q-learning (Sect 1, pg. 2, para 2, Q-learning) and a cross-entropy method (CEM) (Sect 1, pg. 2, para 2, QTOpt), [Sect 1, pg. 2, para 2, To make maximal use of this diverse dataset, we propose an off-policy training method based on a continuous-action generalization of Q-learning, which we call QTOpt… QT-Opt dispenses with the need to train an explicit actor; Abstract, we introduce QT-Opt, a scalable self-supervised vision-based reinforcement learning framework that can leverage over 580k real-world grasp attempts to train a deep neural network] 
wherein performing each of the robotic episodes comprises: selecting [Appx. B, para 2-4, During the early stages of training… we collect our initial data for training using a scripted policy πscripted… During the later stages of training, we switch to data collection with πnoisy], from at least a first exploration strategy (Appx. B, πscripted) and a second exploration strategy (Appx. B, πnoisy), a selected exploration strategy for: the robotic episode as a whole, or each of multiple steps of the robotic episode [Appx. B, para 1, For data collection, we used two different exploration policies πscripted, πnoisy at different stages of training]; 
determining robotic actions to perform, in the robotic episode, according to the selecting [Appx. B, The πscripted simplifies the multi-step exploration of the problem by randomly choosing an (x, y) coordinate above the table, lowering the open gripper to table level in a few random descent steps, closing the gripper, then returning to the original height in a few ascent steps.]; 
Kalashnikov is analogous to the claimed invention as they both relate to Reinforcement Learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Haarnoja, Nair, and Zhang(2)’s teachings to incorporate the teachings of Kalashnikov and provide using Q-learning and a CEM [Kalashnikov, sect 1, pg. 2, para 2] in order to improve instability issues, as well as providing multiple exploration strategies [Kalshnikov, appx. C.1] as diverse data distributions result in improved policies.

Zhang-Haarnoja-Nair-Zhang(2)-Kalashnikov do not teach subsequent to pre-training the actor network and pre-training the critic network: performing online robotic episodes based on the actor network and/or the critic network, further training the actor network and the critic network using reinforcement learning and online episode data from the robotic episodes, wherein further training the actor network and the critic network comprises: further training the actor network based on a first set of the episode data, and using the advantage-weighted regression training objective, and further training the critic network based on a second set of the episode data, and using Q-learning and CEM.

Vecerik teaches,
subsequent to pre-training the actor network and pre-training the critic network: further training (Algorithm 1, Update) the actor network (Algorithm 1, Update the actor) and the critic network (Algorithm 1, Update the critic) using reinforcement learning (Sect 1, para 1, RL paradigms) and online episode data (Algorithm 1, Learning via interaction with the environment) from robotic episodes (Sect 1, para 1, kinesthetic demonstrations to guide a deep-RL algorithm) [Sect 1, para 1, In this paper we address this challenge by combining the demonstration and RL paradigms into a single framework which uses kinesthetic demonstrations to guide a deep-RL algorithm]
performing (Algorithm 1, learning) online robotic episodes (Algorithm 1, interaction with the environment) based on the actor network and/or the critic network (Algorithm 1, update the critic; update the actor)

    PNG
    media_image1.png
    913
    1206
    media_image1.png
    Greyscale

Vecerik is analogous to the claimed invention as they both relate to Reinforcement Learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Haarnoja, Nair, Zhang(2), and Kalashnikov’s teachings to incorporate the teachings of Vecerik and provide training the actor and critic network using online episode data from robotic episodes [Vecerik, Abstract] in order to reduce common exploration issues.

Zhang-Haarnoja-Nair-Zhang(2)-Kalashnikov-Vecerik teach the limitations of claim 18 including using the advantage-weighted regression training objective, using Q-learning and the CEM (see above).

However, Zhang-Haarnoja-Nair-Zhang(2)-Kalashnikov-Vecerik do not teach wherein further training the actor network and the critic network comprises: further training the actor network based on a first set of the episode data, and using the advantage-weighted regression training objective, and further training the critic network based on a second set of the episode data, and using Q-learning and CEM.

Yan teaches,
wherein further training the actor network and the critic network comprises: further training the actor network based on a first set of the episode data [Sect VI (B), para 1, for training the actor model only successful grasps are used]
further training the critic network based on a second set of the episode data [Sect VI (B), para 1, For training the critic model, both successful and failed grasps are used]
Yan is analogous to the claimed invention as they both relate to reinforcement learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Haarnoja, Nair, Zhang(2), Kalashnikov, and Vecerik’s teachings to incorporate the teachings of Yan and provide training the actor network based on a first set of the episode data, and training the critic network based on a second set of the episode data, wherein the second set includes a given quantity of unsuccessful episode data, that is from unsuccessful episodes of the robotic episodes, and wherein the given quantity is greater than an alternate quantity, of the unsuccessful episode data, that is included in the first set in order to improve learning outcomes by drawing from diversified datasets.

Regarding claim 19, Zhang-Haarnoja-Nair-Zhang(2)-Kalashnikov-Vecerik-Yan teach all the limitations of claim 18.

Kalashnikov further teaches,
wherein the first exploration strategy (Sect 5, para 5, scripted policy… described in… Appendix B) is a CEM policy (Sect 4.2, cross-entropy method (CEM)) in which CEM is performed [Sect 4.2, In our algorithm, which we call QT-Opt… We use the cross-entropy method (CEM) to perform this optimization; Sect 5, para 5, We switched to using the learned QT-Opt policy once it reached a success rate of 50%. The scripted policy is described in the supplementary material, in Appendix B.], 
and wherein the second exploration (Appx. B, para 4, πnoisy) strategy is a greedy (Appx. B, para 4, This exploration policy uses epsilon-greedy exploration) Gaussian policy in which a Gaussian probability distribution (Appx. B, para 4, Gaussian with probability) [Appx. B, para 4, During the later stages of training, we switch to data collection with πnoisy. This exploration policy uses epsilon-greedy exploration to trade off between choosing exploration actions or actions that maximize the Q-function estimate. The policy πnoisy chooses a random action with probability                         
                            ∈
                        
                     = 20%, otherwise the greedy action is chosen. To choose a random action, πnoisy samples a pose change t, r from a Gaussian with probability], 
Kalshnikov is analogous to the claimed invention as they both relate to reinforcement learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Haarnoja, Nair, Zhang(2), Vecerik, and Yan’s teachings to incorporate the teachings of Kalshnikov and provide multiple exploration strategies based on varying calculations [Kalshnikov, appx. C.1] in order to generate diverse datasets that result in improved policies and outcomes.

Zhang-Haarnoja-Nair-Zhang(2)-Kalashnikov-Vecerik do not teach, using the critic network and sampled actions, and results from the CEM are utilized in selecting an action; and exploration generated using the actor network based on a corresponding state and corresponding to candidate actions, is utilized in selecting an action.

Yan further teaches, 
using the critic network (Sect VI (D), using the critic model) and sampled actions (Sect VI (D), initial Gaussian), and results from the CEM (Sect VI (D), evaluating the CEM) are utilized in selecting an action (Sect VI (D), action with highest predicted value is selected) [Sect VI (D), For evaluating the CEM method using the critic model, we set the initial Gaussian to have a standard deviation of 15cm in horizontal direction, 6cm in vertical direction and 90◦ in rotation. This distribution is chosen to cover the space of the tray. The CEM is run for 3 iterations (see [15]) and the action with highest predicted value is selected.];
and exploration (Sect VI (D), a policy) generated (Sect VI (D), evaluated) using the actor network (Sect VI (D), actor model) based on a corresponding state (Sect VI (D), at each time step) and corresponding to candidate actions (Sect VI (D), and taking the action), is utilized in selecting an action (Sect VI (D), action scored highest by the critic model is selected) [Sect VI (D), We evaluate the actor model by predicting 64 samples at each time step and taking the action with the highest probability density… We also evaluated a policy that combines both the actor model and the critic model, where the actor model predicts 64 samples, and the samples are evaluated by the critic model. Finally the action scored highest by the critic model is selected.].
Yan is analogous to the claimed invention as they both relate to reinforcement learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Haarnoja, Nair, Zhang(2), Vecerik, and Kalshnikov’s teachings to incorporate the teachings of Yan and provide the various explorations strategies incorporated with actor-critic in order to efficiently adjust policy parameters and improve sample efficiency of reinforcement learning methodologies.

Claim(s) 15 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang in view of Haarnoja, Nair, Zhang(2), Kalashnikov, Vecerik, and Yan, and in further view of Zhang et al. (WO 2020062911 A1, see attached document), hereinafter Zhang(3).

Regarding claim 15, Zhang-Haarnoja-Nair-Zhang(2)-Kalashnikov-Vecerik-Yan teach the limitations of claims 1 including sampling using CEM (see claim 1).

Zhang-Haarnoja-Nair-Zhang(2)-Kalashnikov-Vecerik-Yan do not teach processing state data of the instance, using the actor network, to generate actor network output; selecting an actor action based on the actor network output; processing the state data and the actor action, using the critic network, to generate an actor action measure for the actor action; processing the state data and each of multiple candidate actions, using the critic network, to generate a corresponding candidate action measure for each of the candidate actions; determining, from amongst the actor action measure and the corresponding candidate action measures, a maximum measure; and using the maximum measure in training of the critic network.

Zhang(3) teaches,
processing state data of the instance, using the actor network, to generate actor network output [Para 0004, at each actor neural network among a plurality of actor neural networks, receiving a current state of the environment for a time step and outputting a continuous action for the current state based on a deterministic policy approximated by the actor neural network, thereby outputting a plurality of continuous actions]; 
selecting an actor action based on the actor network output [Para 0004, receiving… the continuous action output by each respective actor neural network]; 
processing the state data (Para 0004, current state of the environment) and the actor action (Para 0004, continuous action), using the critic network (Para 0004, approximated by the critic neural network), to generate (Para 0004, outputting) an actor action measure for the actor action (Para 0004, a state-action value for… the respective continuous action) [Para 0004, at a critic neural network, receiving the current state of the environment and the continuous action output by each respective actor neural network and outputting a state-action value for the state and the respective continuous action based on a state-action value function approximated by the critic neural network, thereby outputting a plurality of states action values, each state-action value, among the plurality of state-action values, associated with a continuous action among the plurality of continuous actions]; 
processing the state data (Para 0004, the current state of the environment) and each of multiple candidate actions (Para 0004, continuous action output by each respective actor), using the critic network (Para 0004, by the critic neural network), to generate (Para 0004, outputting) a corresponding candidate action measure for each of the candidate actions (Para 0004, a state-action value for… the respective continuous action) [Para 0004, at a critic neural network, receiving the current state of the environment and the continuous action output by each respective actor neural network and outputting a state-action value for the state and the respective continuous action based on a state-action value function approximated by the critic neural network, thereby outputting a plurality of states action values, each state-action value, among the plurality of state-action values, associated with a continuous action among the plurality of continuous actions]; 
determining (Para 0004, selecting), from amongst the actor action measure (Para 0004, at an action selector) and the corresponding candidate action measures (Para 0004, with a state-action value), a maximum measure (Para 0004, value that is maximum) [Para 0004, selecting, at an action selector, from among the plurality of continuous actions, a continuous action, wherein the selected continuous action is associated with a state-action value that is maximum among the plurality of state-action values]; 
and using the maximum measure in training of the critic network [Para 0004, selecting, at an action selector, from among the plurality of continuous actions, a continuous action, wherein the selected continuous action is associated with a state-action value that is maximum among the plurality of state-action values; Para 0007, In another aspect of the present disclosure, the respective update for parameters of each respective actor neural network of the plurality of actor neural; Para 0008, In another aspect of the present disclosure, the method includes determining, based on the batch of tuples, an update for parameters of the critic neural network].
Zhang(3) is analogous to the claimed invention as they both relate to reinforcement learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Haarnoja, Nair, Zhang(2), Vecerik, Kalshnikov, and Yan’s teachings to incorporate the teachings of Zhang(3) and provide generating a maximum measurements using a critic network and CEM in order to produce high quality training data for a critic network to utilize optimal policies when learning actions for an agent.

Regarding claim 16, Zhang-Haarnoja-Nair-Zhang(2)-Kalashnikov-Vecerik-Yan teach the limitations of claims 1, including sampling using CEM (see claim 1).

Zhang-Haarnoja-Nair-Zhang(2)-Kalashnikov-Vecerik-Yan do not teach processing state data of the instance, using the actor network, to generate actor network output; selecting an actor action based on the actor network output; using the actor action, as an initial mean for CEM in sampling candidate actions; processing the state data and each of the candidate actions, using the critic network, to generate a corresponding candidate action measure for each of the candidate actions; determining, from amongst the corresponding candidate action measures, a maximum measure; and using the maximum measure in training of the critic network.

Zhang(3) teaches, 
processing state data of the instance, using the actor network, to generate actor network output [Para 0004, at each actor neural network among a plurality of actor neural networks, receiving a current state of the environment for a time step and outputting a continuous action for the current state based on a deterministic policy approximated by the actor neural network, thereby outputting a plurality of continuous actions]; 
selecting an actor action based on the actor network output [Para 0004, receiving… the continuous action output by each respective actor neural network]; 
processing the state data (Para 0004, the current state of the environment) and each of multiple candidate actions (Para 0004, continuous action output by each respective actor) using the critic network (Para 0004, by the critic neural network), to generate (Para 0004, outputting) a corresponding candidate action measure for each of the candidate actions (Para 0004, a state-action value for… the respective continuous action) [Para 0004, at a critic neural network, receiving the current state of the environment and the continuous action output by each respective actor neural network and outputting a state-action value for the state and the respective continuous action based on a state-action value function approximated by the critic neural network, thereby outputting a plurality of states action values, each state-action value, among the plurality of state-action values, associated with a continuous action among the plurality of continuous actions]; 
determining (Para 0004, selecting), from amongst the corresponding candidate action measures (Para 0004, with a state-action value), a maximum measure (Para 0004, value that is maximum) [Para 0004, selecting, at an action selector, from among the plurality of continuous actions, a continuous action, wherein the selected continuous action is associated with a state-action value that is maximum among the plurality of state-action values]; 
and using the maximum measure in training of the critic network [Para 0004, selecting, at an action selector, from among the plurality of continuous actions, a continuous action, wherein the selected continuous action is associated with a state-action value that is maximum among the plurality of state-action values; Para 0007, In another aspect of the present disclosure, the respective update for parameters of each respective actor neural network of the plurality of actor neural; Para 0008, In another aspect of the present disclosure, the method includes determining, based on the batch of tuples, an update for parameters of the critic neural network].
Zhang(3) is analogous to the claimed invention as they both relate to reinforcement learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Haarnoja, Nair, Zhang(2), Vecerik, Kalshnikov, and Yan’s teachings to incorporate the teachings of Zhang(3) and provide generating a maximum measurements using a critic network and CEM in order to produce high quality training data for a critic network to utilize optimal policies when learning actions for an agent.

Haarnoja further teaches,
using the actor action (Sect 5, pg. 7, para 4, the policy), as an initial mean (Sect 5, pg. 7, para 4, with respect to) for CEM (Sect 5, pg. 7, para 4, maximum entropy objective) in sampling candidate actions (Sect 5, pg. 7, para 4, optimal dual variable) [Sect 5, pg. 7, para 4, This dual objective is closely related to the maximum entropy objective with respect to the policy, and the optimal policy is the maximum entropy policy corresponding to temperature… We can solve for the optimal dual variable]; 
Haarnoja is analogous to the claimed invention as they both relate to reinforcement learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Nair, Zhang(2), Vecerik, Kalshnikov, Yan and Zhang(3)’s teachings to incorporate the teachings of Haarnoja and provide using the actor action as an initial mean for CEM in sampling candidate actions in order to select high quality actions that enhance policies.

Claim(s) 17 is rejected under 35 U.S.C. 103 as being unpatentable over Zhang in view of Haarnoja, Nair, Zhang(2), Kalashnikov, Vecerik and Yan, and in further view of Kumar Karn et al (US 20220293267 A1), hereinafter Kumar Karn.

Regarding claim 17, Zhang-Haarnoja-Nair-Zhang(2)-Kalashnikov-Vecerik-Yan teach the limitations of claims 1.

Zhang further teaches, 
subsequent to the further training: using the actor network in autonomous control of a robot [Para 0071, The DDPG algorithm and the improved DDPG algorithm based on the offline model were used to train for 4000 rounds respectively, and the relationship between the feedback reward value of the robot, i.e. the 2D dummy, reaching the end point from the starting point and the number of training rounds was analyzed.].

Zhang-Haarnoja-Nair-Zhang(2)-Kalashnikov-Vecerik-Yan do not teach the actor network being independent of the critic network.

Kumar Karn teaches,
the actor network being independent of the critic network [Para 0046, the actor critic technique is a temporal difference (TD) method that has a separate memory structure (i.e., actor) to explicitly represent the policy independent of the value function (i.e., critic)].
Kumar Karn is analogous to the claimed invention as they both relate to the utilization of actor-critic methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Haarnoja, Nair, Zhang(2), Vecerik, Kalshnikov, and Yan teachings to incorporate the teachings of Kumar Karn and provide the actor network being independent of the critic network to employ diversified datasets by separating exploration and policy methodologies.

Claim(s) 20 is rejected under 35 U.S.C. 103 as being unpatentable over Zhang in view of Haarnoja, Nair, Zhang(2), Kalashnikov, and Vecerik, and in further view of Zhang(3).

Regarding Claim 20, Zhang teaches,
A method implemented by one or more processors [Fig. 1], the method comprising: pre-training (Para 0133, pre-trains) an actor network and a critic network (Para 0133, the DDPG network offline; Para 0078, Generally speaking, the DDPG network application is based on the Actor-Critic method) using reinforcement learning (Para 0133, reinforcement learning) and robotic demonstration data (Para 0071, number of training rounds) from demonstrated robotic episodes (Para 0071, the 2D dummy, reaching the end point from the starting point) [Para 0070, Step 1: Collect the training data of 2D dummies in an offline environment and preprocess the training data to obtain a training data set; Para 0071, The DDPG algorithm and the improved DDPG algorithm based on the offline model were used to train for 4000 rounds respectively, and the relationship between the feedback reward value of the robot, i.e. the 2D dummy, reaching the end point from the starting point and the number of training rounds was analyzed; Para 0133, The present invention firstly utilizes a large amount of offline data to train the object state model and reward model, and then pre-trains the DDPG network offline through a model-based reinforcement learning method to improve the decision-making ability of the network offline, thereby accelerating the subsequent online learning efficiency and performance; Para 0078, Generally speaking, the DDPG network application is based on the Actor-Critic method, so it has a policy neural network and a value-based neural network. It includes a policy network for generating actions and a value network for judging the quality of actions],

Zhang does not teach wherein the actor network is a first neural network model that represents a policy, wherein the critic network is a second neural network model that represents a Q-function, and wherein pre-training the actor network and the critic network comprises: pre-training the actor network using an advantage-weighted regression training objective, the advantage-weighted regression training objective utilizing values generated using the second neural network model, pre-training the critic network based on the robotic demonstration data, using Q-learning and a cross-entropy method (CEM), subsequent to pre-training the actor network and pre-training the critic network: performing online robotic episodes wherein performing a given robotic episode: further training the actor network and the critic network using reinforcement learning and episode data from robotic episodes, wherein further training the critic network based on a second set of the episode data comprises, for an instance of the second set of episode data: processing state data of the instance, using the actor network, to generate actor network output; selecting an actor action based on the actor network output; processing the state data and the actor action, using the critic network, to generate an actor action measure for the actor action; processing the state data and each of multiple candidate actions sampled using CEM, using the critic network, to generate a corresponding candidate action measure for each of the candidate actions; determining, from amongst the actor action measure and the corresponding candidate action measures, a maximum measure; and using the maximum measure in training of the critic network.

Haarnoja teaches,
wherein the actor network is a first neural network model (Sect 4.1, para 1, neural networks) that represents (Sect 4.1, para 1, given by) a policy (Sect 4.1, para 1, the policy) [Sect 4.1, para 1, the policy as a Gaussian with mean and covariance given by neural networks], 
wherein the critic network is a second neural network model that represents a Q-function [Sect 4.1, para 1, the soft Q-function can be modeled as expressive neural networks], 
Haarnoja is analogous to the claimed invention as they both relate to Reinforcement Learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang’s teachings to incorporate the teachings of Haarnoja and provide the actor network representing a policy and the critic network representing a Q-function in order to attain optimal results from the combined methodologies of policy usage and exploration.

Zhang-Haarnoja do not teach wherein pre-training the actor network and the critic network comprises: pre-training the actor network using an advantage-weighted regression training objective, the advantage-weighted regression training objective utilizing values generated using the second neural network model, pre-training the critic network based on the robotic demonstration data, using Q-learning and a cross-entropy method (CEM), subsequent to pre-training the actor network and pre-training the critic network: performing online robotic episodes wherein performing a given robotic episode: further training the actor network and the critic network using reinforcement learning and episode data from robotic episodes, wherein further training the critic network based on a second set of the episode data comprises, for an instance of the second set of episode data: processing state data of the instance, using the actor network, to generate actor network output; selecting an actor action based on the actor network output; processing the state data and the actor action, using the critic network, to generate an actor action measure for the actor action; processing the state data and each of multiple candidate actions sampled using CEM, using the critic network, to generate a corresponding candidate action measure for each of the candidate actions; determining, from amongst the actor action measure and the corresponding candidate action measures, a maximum measure; and using the maximum measure in training of the critic network.

Nair teaches,
and wherein pre-training the actor network and the critic network comprises: pre-training (Sect I, pg. 2, col 1, para 2, pre-training) the actor network (Sect IV, pg. 5, col 2, para 1, we can parameterize the actor… by neural networks) using an advantage-weighted regression training objective (Sect I, pg. 2, col 1, para 2, AWAC), the advantage-weighted regression training objective utilizing (Sect II, col 2, para 2, updated based on) values (Sect II, col 2, para 2, current estimate of Qπ) generated (Sect II, col 2, para 2, estimated) using the second neural network model (Sect II, col 2, para 2, the critic Qπ (s, a)) [Sect I, pg. 1, col 2, para 4, In this work, we study how to build RL algorithms that are effective for pre-training from off-policy datasets; Sect I, pg. 2, col 1, para 2, The contribution of this work is not just another RL algorithm, but a systematic study of what makes offline pre-training with online fine-tuning unique compared to the standard RL paradigm, which then directly motivates a simple algorithm, AWAC, to address these challenges; Sect II, col 2, para 2, During the policy evaluation phase, the critic Qπ (s, a) is estimated for the current policy π… During policy improvement, the actor π is typically updated based on the current estimate of Qπ; Sect IV, pg. 5, col 2, para 1, we can parameterize the actor and the critic by neural networks], 
Nair is analogous to the claimed invention as they both relate to Reinforcement Learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang and Haarnoja’s teachings to incorporate the teachings of Nair and provide an advantage-weighted regression training objective [Nair, Abstract] in order to enable rapid learning of skills with a combination of prior demonstration data and online experience.

Zhang-Haarnoja-Nair do not teach pre-training the critic network based on the robotic demonstration data, using Q-learning and a cross-entropy method (CEM), subsequent to pre-training the actor network and pre-training the critic network: performing online robotic episodes wherein performing a given robotic episode: further training the actor network and the critic network using reinforcement learning and episode data from robotic episodes, wherein further training the critic network based on a second set of the episode data comprises, for an instance of the second set of episode data: processing state data of the instance, using the actor network, to generate actor network output; selecting an actor action based on the actor network output; processing the state data and the actor action, using the critic network, to generate an actor action measure for the actor action; processing the state data and each of multiple candidate actions sampled using CEM, using the critic network, to generate a corresponding candidate action measure for each of the candidate actions; determining, from amongst the actor action measure and the corresponding candidate action measures, a maximum measure; and using the maximum measure in training of the critic network.

Zhang(2) teaches,
pre-training (Sect 4, pg. 4, para 6, To pretrain) the critic network (Sect 4, pg. 4, para 6, actor-critic RL algorithms like DDPG and ACER) based on the robotic demonstration data (Sect 4, pg. 4, col 1, para 2, HalfCheetah (6D), Hopper (3D), and Walker2d (6D)) [Sect 4, pg. 4, col 1, para 2, Therefore if there exist some expert demonstrations that perform better than initial policies, we can introduce the data using constraint (3), in order to obtain a more accurate estimator Qw(s, a); Sect 5, para 4, With DDPG as baseline, we apply our algorithm to low dimensional simulation environments using the MuJoCo physics engine [Todorov et al., 2012], and test on tasks with action dimensionality are: HalfCheetah (6D), Hopper (3D), and Walker2d (6D); Sect 4, pg. 4, para 6, To pretrain actor-critic RL algorithms like DDPG and ACER, we add gradients                 
                    
                        
                            g
                        
                        
                            Q
                        
                        
                            *
                        
                    
                
             and                 
                    
                        
                            g
                        
                        
                            π
                        
                        
                            *
                        
                    
                
             to the original gradients of the algorithms; Sect 4.1, para 2, Two neural networks are used in DDPG at the same time. One is named critic network] 
Zhang(2) is analogous to the claimed invention as they both relate to Reinforcement Learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Haarnoja, and Nair’s teachings to incorporate the teachings of Zhang(2) and provide pre-training the critic network based on demonstration data [Zhang(2), Abstract] in order to speed up the training process.

Zhang-Haarnoja-Nair-Zhang(2) do not teach using Q-learning and a cross-entropy method (CEM), subsequent to pre-training the actor network and pre-training the critic network: performing online robotic episodes wherein performing a given robotic episode: further training the actor network and the critic network using reinforcement learning and episode data from robotic episodes, wherein further training the critic network based on a second set of the episode data comprises, for an instance of the second set of episode data: processing state data of the instance, using the actor network, to generate actor network output; selecting an actor action based on the actor network output; processing the state data and the actor action, using the critic network, to generate an actor action measure for the actor action; processing the state data and each of multiple candidate actions sampled using CEM, using the critic network, to generate a corresponding candidate action measure for each of the candidate actions; determining, from amongst the actor action measure and the corresponding candidate action measures, a maximum measure; and using the maximum measure in training of the critic network.

Kalashnikov teaches,
using Q-learning (Sect 1, pg. 2, para 2, Q-learning) and a cross-entropy method (CEM) (Sect 1, pg. 2, para 2, QTOpt), [Sect 1, pg. 2, para 2, To make maximal use of this diverse dataset, we propose an off-policy training method based on a continuous-action generalization of Q-learning, which we call QTOpt… QT-Opt dispenses with the need to train an explicit actor; Abstract, we introduce QT-Opt, a scalable self-supervised vision-based reinforcement learning framework that can leverage over 580k real-world grasp attempts to train a deep neural network] 
Kalashnikov is analogous to the claimed invention as they both relate to Reinforcement Learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Haarnoja, Nair, and Zhang(2)’s teachings to incorporate the teachings of Kalashnikov and provide using Q-learning and a CEM [Kalashnikov, sect 1, pg. 2, para 2] in order to improve instability issues.

Zhang-Haarnoja-Nair-Zhang(2)-Kalshnikov do not teach subsequent to pre-training the actor network and pre-training the critic network: performing online robotic episodes wherein performing a given robotic episode: further training the actor network and the critic network using reinforcement learning and episode data from robotic episodes, wherein further training the critic network based on a second set of the episode data comprises, for an instance of the second set of episode data: processing state data of the instance, using the actor network, to generate actor network output; selecting an actor action based on the actor network output; processing the state data and the actor action, using the critic network, to generate an actor action measure for the actor action; processing the state data and each of multiple candidate actions sampled using CEM, using the critic network, to generate a corresponding candidate action measure for each of the candidate actions; determining, from amongst the actor action measure and the corresponding candidate action measures, a maximum measure; and using the maximum measure in training of the critic network.

Vecerik teaches,
subsequent to pre-training the actor network and pre-training the critic network: performing (Algorithm 1, learning) online robotic episodes (Algorithm 1, interaction with the environment) 
wherein performing a given robotic episode: further training (Algorithm 1, Update) the actor network (Algorithm 1, Update the actor) and the critic network (Algorithm 1, Update the critic) using reinforcement learning (Sect 1, para 1, RL paradigms) and episode data (Algorithm 1, Learning via interaction with the environment) from robotic episodes (Sect 1, para 1, kinesthetic demonstrations to guide a deep-RL algorithm) [Sect 1, para 1, In this paper we address this challenge by combining the demonstration and RL paradigms into a single framework which uses kinesthetic demonstrations to guide a deep-RL algorithm]


    PNG
    media_image1.png
    913
    1206
    media_image1.png
    Greyscale

Vecerik is analogous to the claimed invention as they both relate to Reinforcement Learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Haarnoja, Nair, Zhang(2), and Kalashnikov’s teachings to incorporate the teachings of Vecerik and provide training the actor and critic network using online episode data from robotic episodes [Vecerik, Abstract] in order to reduce common exploration issues.

Zhang-Haarnoja-Nair-Zhang(2)-Kalshnikov-Vecerik teach the limitations above, including sampling using CEM (see above).

Zhang-Haarnoja-Nair-Zhang(2)-Kalshnikov-Vecerik do not teach wherein further training the critic network based on a second set of the episode data comprises, for an instance of the second set of episode data: processing state data of the instance, using the actor network, to generate actor network output; selecting an actor action based on the actor network output; processing the state data and the actor action, using the critic network, to generate an actor action measure for the actor action; processing the state data and each of multiple candidate actions using the critic network, to generate a corresponding candidate action measure for each of the candidate actions; determining, from amongst the actor action measure and the corresponding candidate action measures, a maximum measure; and using the maximum measure in training of the critic network.

Zhang(3) teaches
wherein further training the critic network based on a second set of the episode data comprises, for an instance of the second set of episode data: processing state data of the instance, using the actor network, to generate actor network output [Para 0004, at each actor neural network among a plurality of actor neural networks, receiving a current state of the environment for a time step and outputting a continuous action for the current state based on a deterministic policy approximated by the actor neural network, thereby outputting a plurality of continuous actions]; 
selecting an actor action based on the actor network output [Para 0004, receiving… the continuous action output by each respective actor neural network]; 
processing the state data (Para 0004, current state of the environment) and the actor action (Para 0004, continuous action), using the critic network (Para 0004, approximated by the critic neural network), to generate (Para 0004, outputting) an actor action measure for the actor action (Para 0004, a state-action value for… the respective continuous action) [Para 0004, at a critic neural network, receiving the current state of the environment and the continuous action output by each respective actor neural network and outputting a state-action value for the state and the respective continuous action based on a state-action value function approximated by the critic neural network, thereby outputting a plurality of states action values, each state-action value, among the plurality of state-action values, associated with a continuous action among the plurality of continuous actions]; 
processing the state data (Para 0004, the current state of the environment) and each of multiple candidate actions (Para 0004, continuous action output by each respective actor) using the critic network (Para 0004, by the critic neural network), to generate (Para 0004, outputting) a corresponding candidate action measure for each of the candidate actions (Para 0004, a state-action value for… the respective continuous action) [Para 0004, at a critic neural network, receiving the current state of the environment and the continuous action output by each respective actor neural network and outputting a state-action value for the state and the respective continuous action based on a state-action value function approximated by the critic neural network, thereby outputting a plurality of states action values, each state-action value, among the plurality of state-action values, associated with a continuous action among the plurality of continuous actions]; 
determining (Para 0004, selecting), from amongst the actor action measure (Para 0004, at an action selector) and the corresponding candidate action measures (Para 0004, with a state-action value), a maximum measure (Para 0004, value that is maximum) [Para 0004, selecting, at an action selector, from among the plurality of continuous actions, a continuous action, wherein the selected continuous action is associated with a state-action value that is maximum among the plurality of state-action values]; 
and using the maximum measure in training of the critic network [Para 0004, selecting, at an action selector, from among the plurality of continuous actions, a continuous action, wherein the selected continuous action is associated with a state-action value that is maximum among the plurality of state-action values; Para 0007, In another aspect of the present disclosure, the respective update for parameters of each respective actor neural network of the plurality of actor neural; Para 0008, In another aspect of the present disclosure, the method includes determining, based on the batch of tuples, an update for parameters of the critic neural network].
Zhang(3) is analogous to the claimed invention as they both relate to reinforcement learning methodologies. Therefore, it would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang, Haarnoja, Nair, Zhang(2), Vecerik, Kalshnikov, and Yan’s teachings to incorporate the teachings of Zhang(3) and provide generating a maximum measurements using a critic network and CEM in order to produce high quality training data for a critic network to utilize optimal policies when learning actions for an agent.


Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SYED RAYHAN AHMED whose telephone number is (571)270-0286. The examiner can normally be reached Mon-Fri ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, David Yi can be reached at (571) 270-7519. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SYED RAYHAN AHMED/Examiner, Art Unit 2126                                                                                                                                                                                                        

/VAN C MANG/Primary Examiner, Art Unit 2126
Read full office action
LEARNING ROBOTIC SKILLS WITH IMITATION AND REINFORCEMENT AT SCALE

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

LEARNING ROBOTIC SKILLS WITH IMITATION AND REINFORCEMENT AT SCALE

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email