Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Regarding rejection of claims under 35 U.S.C. 103, Applicant’s arguments are directed towards amendments to claims that have not been previously examined, and for which new grounds of rejection are given below.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-15 rejected under 35 U.S.C. 103 over Kalakrishnan et al., US Pre-Grant Publication No. 2022/0105624 (hereafter Kalakrishnan) in view of Pascanu et al., US Pre-Grant Publication No. 2020/0090048 (hereafter Pascanu).
Regarding claim 1 and analogous claims 6 and 11:
Kalakrishnan teaches:
“A computer program product for learning a policy model, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to“: Kalakrishnan, paragraph 0155, “In addition, some implementations include one or more processors (e.g., central processing unit(s) (CPU(s)), graphics processing unit(s) (GPU(s), and/or tensor processing unit(s) (TPU(s)) of one or more computing devices, where the one or more processors are operable to execute instructions stored in associated memory, and where the instructions are configured to cause performance of any of the methods described herein. Some implementations also include one or more non-transitory computer readable storage media storing computer instructions executable by one or more processors to perform any of the methods described herein [a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the compute].”
“approximate an expert policy for imitation learning to obtain a determined imitation learning expert policy”: Kalakrishnan, paragraph 0003, “In some implementations, the meta-learning model can include a trial policy used to gather information about a new task through imitation learning. One or more demonstrations of the new task [an expert policy] can be used to generate the trial policy for the new task [approximate an expert policy for imitation learning to obtain a determined imitation learning expert policy]. This trial policy can then be used to generate one or more trial and error attempts of preforming the task. In some implementations, the trial policy can be used to constrain potential actions made by the agent when attempting to perform the task in the trial and error attempts. Additionally or alternatively, the meta-learning model can include an adapted trial policy which can be used to extract and integrate information from the trial(s) with the demonstration(s) to learn how to perform the new task. The adapted trial policy can be trained using reinforcement learning.”
“iteratively train a policy model neural network using the determined imitation learning expert policy to perform batch mode reinforcement learning that is based on the determined imitation learning expert policy by”: Kalakrishnan, paragraphs 0039-0040: “To leverage both demonstration and trial episode information, a Q-value function Q(s, a; ΘQ) can be maintained that is decomposed into the value and normalized advantage function, as in equation (2). Note that, to incorporate both demonstrations from an expert and trials taken by the agent, an imitation learning objective is not necessarily needed since reinforcement learning [reinforcement learning ] objectives can also learn from successful trajectories. Thus, in some implementations, the Bellman error LRL can be used in the inner adaptation step. The adapted Q-value function Q(s, a; Q(s, a; ΦQ') can be obtained by taking gradient steps with respect to LRL evaluated on a batch [batch mode ] of demonstration and trial episodes {τ1, . . . , τK} corresponding to task Τi:
PNG
media_image1.png
55
253
media_image1.png
Greyscale
where the first episode τ1 is a demonstration and where k ∈ {l, . . . , K} are the trials taken by the agent [iteratively train a policy model neural network using the determined imitation learning expert policy].“
(bold only) “determining a penalty value based on a product of a logarithm of an output of the policy model neural network using the determined imitation learning expert policy for an action and a state at a time, and a hyperparameter for regularization”: Kalakrishnan, paragraph 0039, “The adapted Q-value function Q(s, a; Q(s, a; ΦQ') can be obtained by taking gradient steps with respect to LRL evaluated on a batch of demonstration and trial episodes {τ1, . . . , τK} corresponding to task Τi:
PNG
media_image1.png
55
253
media_image1.png
Greyscale
where the first episode τ1 is a demonstration and where k ∈ {l, . . . , K} are the trials taken by the agent [determining … an output of the policy model neural network using the determined imitation learning expert policy for an action and a state at a time].”
“modifying the policy model neural network at the iteration to decrease a difference between an output of the policy model neural network and the target signal”: Kalakrishnan, paragraph 0036, “This Q-function can be learned through standard Q-learning techniques, for example by minimizing [modifying the policy model neural network at the iteration to decrease] Bellman error:
PNG
media_image2.png
36
299
media_image2.png
Greyscale
[showing a difference calculation between Q(st, at; ΦQ' ) (an output of the policy model neural network) and rt + γV(st+1;Φ t) (target signal)]”
Kalakrishnan does not explicitly teach:
(bold only) “determining a penalty value based on a product of a logarithm of an output of the policy model neural network using the determined imitation learning expert policy for an action and a state at a time, and a hyperparameter for regularization”
“determining a soft Bellman backup value that utilizes a reward and the product of a proportion of a discount factor and the hyperparameter, and a logarithm of the hyperparameter and an estimate of an action value for data at the time at an iteration”
“generating a target signal by combining the penalty value and - soft Bellman backup value”
Pascanu teaches:
(bold only) “determining a penalty value based on a product of a logarithm of an output of the policy model neural network using the determined imitation learning expert policy for an action and a state at a time, and a hyperparameter for regularization”: Pascanu, paragraph 0060, “The Bellman updates are softened in the sense that the usual max operator over actions for the state values Vi, is replaced by a soft-max at inverse temperature, which hardens into a max operator as:ß → ∞. The optimal policy πi is then a Boltzmann policy at inverse temperature ß [showing that ß is a temperature value used in training, hence a kind of hyperparameter for regularization]”; Pascanu, paragraph 0055-0056, “The task policies and the multitask policy are generated together in a coordinated training process by optimizing an objective function which comprises a term indicative of expected returns and one or more regularization terms which provide policy regularization [a penalty value ]. A first regularization term ensures that each task policy it, is regularized towards the multitask policy, and may be defined using discounted KL divergences
PNG
media_image3.png
59
202
media_image3.png
Greyscale
An additional regularization term is based on discounted entropy to further encourage exploration. Specifically, the objective function to be maximized is:
PNG
media_image4.png
246
615
media_image4.png
Greyscale
[includes hyperparameter for regularization ß] where cKL and cEnt are scalar factors greater than zero which determine the strengths of the KL and entropy regularizations, α = cKL/(cKL+cEnt,) and ß=1/(cKL+cEnt). The log π0 (at I st) term [a logarithm of an output of the policy model neural network using the … policy for an action and a state at a time] can be thought of as a reward shaping term which encourages actions which have high probability under the multitask policy, while the entropy term -log πt (at I st) encourages exploration. In the above we used the same regularization costs cKL and cEnt for all tasks. However, it is straightforward to generalize this to task-specific costs; this can be important if tasks differ substantially in their reward scales and amounts of exploration needed, although it does introduce additional hyper-parameters.”
“determining a soft Bellman backup value that utilizes a reward and the product of a proportion of a discount factor and the hyperparameter, and a logarithm of the hyperparameter and an estimate of an action value for data at the time at an iteration”: Pascanu, paragraph 0053, “For simplicity we assume that each task has an infinite horizon, and each has the same discount factor γ [showing that γ is a discount factor]”; Pascanu, paragraph 0059, “In step 401, πi is modified with π0 fixed. With π0 fixed, (1) decomposes into separate maximization problems for each task, and is an entropy regularized expected return with a redefined (regularized) reward
PNG
media_image5.png
53
250
media_image5.png
Greyscale
[that utilizes a reward]. It can be optimized using soft Q learning (also known as G learning) which is based on deriving the following ‘softened’ Bellman updates [determining a soft Bellman backup value] for the state and action values (see for example, J. Schulman, P. Abbeel, and X. Chen. Equivalence between policy gradients and soft Q-Learning, arXiv: 1704.06440, 2017):
PNG
media_image6.png
221
929
media_image6.png
Greyscale
[showing that in (2), the ratio 1/ß is included in Vi, which in (3) is a part of a product which includes γ. Hence, the term in (3) includes γ/ß, a proportion of a discount factor and the hyperparameter. Further, (2) includes ß and π0α(at | st), inside a summation which is itself inside “log,” hence, a logarithm of the hyperparameter and an estimate of an action value for data at the time at an iteration. Further, (3) includes reward function R].”
“generating a target signal by combining the penalty value and - soft Bellman backup value; and”: Pascanu, paragraph 0055-0056, “The task policies and the multitask policy are generated together in a coordinated training process by optimizing an objective function which comprises a term indicative of expected returns and one or more regularization terms which provide policy regularization [the penalty value ]. A first regularization term ensures that each task policy it, is regularized towards the multitask policy, and may be defined using discounted KL divergences
PNG
media_image3.png
59
202
media_image3.png
Greyscale
An additional regularization term is based on discounted entropy to further encourage exploration. Specifically, the objective function to be maximized is:
PNG
media_image4.png
246
615
media_image4.png
Greyscale
[showing that formula (1) includes the regularization term acting as a penalty value]”; Pascanu, paragraph 0059, “In step 401, πi is modified with π0 fixed. With π0 fixed, (1) decomposes into separate maximization problems for each task, and is an entropy regularized expected return with a redefined (regularized) reward
PNG
media_image5.png
53
250
media_image5.png
Greyscale
. It can be optimized using soft Q learning (also known as G learning) which is based on deriving the following ‘softened’ Bellman updates [soft Bellman backup value] for the state and action values (see for example, J. Schulman, P. Abbeel, and X. Chen. Equivalence between policy gradients and soft Q-Learning, arXiv: 1704.06440, 2017):
PNG
media_image6.png
221
929
media_image6.png
Greyscale
[hence the formula (2) for target value Vi includes soft Bellman backup value and is an optimization of penalty value from formula (1).
Pascanu and Kalakrishnan are analogous arts as they are both related to reinforcement learning methods. It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the claimed invention to have combined the use of soft Bellman, temperature, and logarithmic functions for training in Pascanu with the teachings of Kalakrishnan to arrive at the present invention, in order to improve model training, as stated in Pascanu, paragraph 0026, “The subject matter described in this specification can be implemented in particular embodiments so as to realize one or more of the following advantages. The methods can be used to more efficiently train a neural network.”
Regarding claim 2 and analogous claims 7 and 12:
Kalakrishnan as modified by Pascanu teaches the computer program product of claim 1.
Kalakrishnan further teaches “wherein the program instructions further cause the computer to obtain the determined imitation learning expert policy using an expert demonstration”: Kalakrishnan, paragraph 0003, “In some implementations, the meta-learning model can include a trial policy used to gather information about a new task through imitation learning. One or more demonstrations of the new task can be used to generate the trial policy for the new task [obtain the determined imitation learning expert policy using an expert demonstration].
Regarding claim 3 and analogous claims 8 and 13:
Kalakrishnan as modified by Pascanu teaches the computer program product of claim 2.
Kalakrishnan further teaches “wherein the program instructions further cause the computer to use a supervised machine learning process to obtain the determined imitation learning expert policy”: Kalakrishnan, paragraph 0037, “In some implementations, a distribution of tasks p(Τ) can be assumed, from which the meta-training tasks {Τi} and held-out meta-test {Τi} tasks are drawn. During meta-training, supervision in the form of expert demonstration trajectories {Τi+} and a binary reward function ri, that can be queried for each of the meta-training tasks Τi can be used [use a supervised machine learning process to obtain the determined imitation learning expert policy].”
Regarding claim 4 and analogous claims 9 and 14:
Kalakrishnan as modified by Pascanu teaches the computer program product of claim 2.
Kalakrishnan further teaches “further comprising generating the expert demonstration from a pre-trained model”: Kalakrishnan, paragraph 0004, “One or more trials of the robot attempting to perform the new task can be generated using the trial policy for the new task. The one or more trials can be used to train the adapted policy using reinforcement learning to perform the new task. Since the trial policy and the adapted trial policy share parameters, training the adapted policy via reinforcement
learning will also update one or more portions of the trial policy. This updated trial policy can then be used to generate additional trial(s) of the new task [generating the expert demonstration from a pre-trained model], which in tum can be used to further train the adapted policy network. In other words, trials for the new task may be continuously generated based on the current trial policy parameters when training the meta-learning model.”
Regarding claim 5 and analogous claims 10 and 15:
Kalakrishnan as modified by Pascanu teaches the computer program product of claim 1.
Kalakrishnan further teaches “wherein the program instructions further cause the computer to perform fitted Q iteration with the determined imitation learning expert policy, where Q is a function that is defined on a state-action space”: Kalakrishnan, paragraphs 0039-0040: “To leverage both demonstration and trial episode information, a Q-value function Q(s, a; ΘQ) can be maintained that is decomposed into the value and normalized advantage function, as in equation (2). Note that, to incorporate both demonstrations from an expert and trials taken by the agent, an imitation learning objective is not necessarily needed since reinforcement learning objectives can also learn from successful trajectories. Thus, in some implementations, the Bellman error LRL can be used in the inner adaptation step. The adapted Q-value function Q(s, a; Q(s, a; ΦQ') [Q is a function that is defined on a state-action space] can be obtained by taking gradient steps with respect to LRL evaluated on a batch of demonstration and trial episodes {τ1, . . . , τK} corresponding to task Τi:
PNG
media_image1.png
55
253
media_image1.png
Greyscale
where the first episode τ1 is a demonstration and where k ∈ {l, . . . , K} are the trials taken by the agent [perform fitted Q iteration with the determined imitation learning expert policy].”
Claims 16-25 rejected under 35 U.S.C. 103 over Kalakrishnan in view of Pascanu and Chao et al., US Pre-Grant Publication No. 2021/0081752 (hereafter Chao).
Regarding claim 16:
Kalakrishnan teaches:
“a hardware processor; and a memory that stores a computer program product, which, when executed by the hardware processor, causes the hardware processor to”: Kalakrishnan, paragraph 0155, “In addition, some implementations include one or more processors (e.g., central processing unit(s) (CPU(s)), graphics processing unit(s) (GPU(s), and/or tensor processing unit(s) (TPU(s)) [a hardware processor] of one or more computing devices, where the one or more processors are operable to execute instructions stored in associated memory, and where the instructions are configured to cause performance of any of the methods described herein. Some implementations also include one or more non-transitory computer readable storage media storing computer instructions executable by one or more processors to perform any of the methods described herein [and a memory that stores a computer program product, which, when executed by the hardware processor, causes the hardware processor].”
“approximate an expert policy for imitation learning to obtain a determined imitation learning expert policy”: Kalakrishnan, paragraph 0003, “In some implementations, the meta-learning model can include a trial policy used to gather information about a new task through imitation learning. One or more demonstrations of the new task [an expert policy] can be used to generate the trial policy for the new task [approximate an expert policy for imitation learning to obtain a determined imitation learning expert policy]. This trial policy can then be used to generate one or more trial and error attempts of preforming the task. In some implementations, the trial policy can be used to constrain potential actions made by the agent when attempting to perform the task in the trial and error attempts. Additionally or alternatively, the meta-learning model can include an adapted trial policy which can be used to extract and integrate information from the trial(s) with the demonstration(s) to learn how to perform the new task. The adapted trial policy can be trained using reinforcement learning.”
“iteratively train a policy model neural network using the determined imitation learning expert policy to perform batch mode reinforcement learning that is based on the determined imitation learning expert policy by”: Kalakrishnan, paragraphs 0039-0040: “To leverage both demonstration and trial episode information, a Q-value function Q(s, a; ΘQ) can be maintained that is decomposed into the value and normalized advantage function, as in equation (2). Note that, to incorporate both demonstrations from an expert and trials taken by the agent, an imitation learning objective is not necessarily needed since reinforcement learning [reinforcement learning ] objectives can also learn from successful trajectories. Thus, in some implementations, the Bellman error LRL can be used in the inner adaptation step. The adapted Q-value function Q(s, a; Q(s, a; ΦQ') can be obtained by taking gradient steps with respect to LRL evaluated on a batch [batch mode ] of demonstration and trial episodes {τ1, . . . , τK} corresponding to task Τi:
PNG
media_image1.png
55
253
media_image1.png
Greyscale
where the first episode τ1 is a demonstration and where k ∈ {l, . . . , K} are the trials taken by the agent [iteratively train a policy model neural network using the determined imitation learning expert policy].“
(bold only) “determining a penalty value based on a product of a logarithm of an output of the policy model neural network using the determined imitation learning expert policy for an action and a state at a time, and a hyperparameter for regularization”: Kalakrishnan, paragraph 0039, “The adapted Q-value function Q(s, a; Q(s, a; ΦQ') can be obtained by taking gradient steps with respect to LRL evaluated on a batch of demonstration and trial episodes {τ1, . . . , τK} corresponding to task Τi:
PNG
media_image1.png
55
253
media_image1.png
Greyscale
where the first episode τ1 is a demonstration and where k ∈ {l, . . . , K} are the trials taken by the agent [determining … an output of the policy model neural network using the determined imitation learning expert policy for an action and a state at a time].”
“modifying the policy model neural network at the iteration to decrease a difference between an output of the policy model neural network and the target signal”: Kalakrishnan, paragraph 0036, “This Q-function can be learned through standard Q-learning techniques, for example by minimizing [modifying the policy model neural network at the iteration to decrease] Bellman error:
PNG
media_image2.png
36
299
media_image2.png
Greyscale
[showing a difference calculation between Q(st, at; ΦQ' ) (an output of the policy model neural network) and rt + γV(st+1;Φ t) (target signal)]”
Kalakrishnan does not explicitly teach:
“An autonomous vehicle training system, comprising:”
“a network interface that communicates with an autonomous vehicle”
“and transmit parameters of the trained policy model neural network to the autonomous vehicle”
(bold only) “determining a penalty value based on a product of a logarithm of an output of the policy model neural network using the determined imitation learning expert policy for an action and a state at a time, and a hyperparameter for regularization”
“determining a soft Bellman backup value that utilizes a reward and the product of a proportion of a discount factor and the hyperparameter, and a logarithm of the hyperparameter and an estimate of an action value for data at the time at an iteration”
“generating a target signal by combining the penalty value and - soft Bellman backup value”
Chao teaches:
“An autonomous vehicle training system, comprising:”: Chao, paragraph 0058, “In at least one embodiment, a robotic device learns to perform a new task based on a video demonstration. The system is capable, in some cases, of learning the new task in as few as one demonstration, even if the demonstration includes steps that are accidental or non-essential to the goal of the demonstration. For example, in at least one embodiment, a robot is trained to perform a task based on a demonstration of the task by a human user [training system]”; Chao, paragraph 0144, “FIG. 11A illustrates an example of an autonomous vehicle 1100 [autonomous vehicle], according to at least one embodiment. In at least one embodiment, autonomous vehicle 1100 (alternatively referred to herein as ‘vehicle 1100’) may be, without limitation, a passenger vehicle, such as a car, a truck, a bus, and/or another type of vehicle that accommodates one or more passengers. In at least one embodiment, vehicle 1100 may be a semi-tractor-trailer truck used for hauling cargo. In at least one embodiment, vehicle 1100 may be an airplane, robotic vehicle, or other kind of vehicle.”
“a network interface that communicates with an autonomous vehicle”: Chao, paragraph 0345, “In at least one embodiment, an I/O switch 2016 can be used to provide an interface mechanism to enable connections between I/O hub 2007 and other components, such as a network adapter 2018 and/or a wireless network adapter 2019 that may be integrated into platform [a network interface that communicates with an autonomous vehicle], and various other devices that can be added via one or more add-in device(s) 2020. In at least one embodiment, network adapter 2018 can be an Ethernet adapter or another wired network adapter. In at least one embodiment, wireless network adapter 2019 can include one or more of a Wi-Fi, Bluetooth, near field communication (NFC), or other network device that includes one or more wireless radios.”
“and transmit parameters of the trained policy model neural network to the autonomous vehicle”: Chao, paragraphs 0152, “Inference and/or training logic 815 are used to perform inferencing and/or training operations associated with one or more embodiments. Details regarding inference and/or training logic 815 are provided herein in conjunction with FIGS. SA and/or 8B. In at least one embodiment, inference and/or training logic 815 may be used in system FIG. 11A for inferencing or predicting operations based, at least in part, on weight parameters calculated using neural network training operations, neural network functions and/or architectures, or neural network use cases described herein”; Chao, paragraph 0148, “For instance, in at least one embodiment, controller(s) 1136 may send signals to operate vehicle brakes via brake actuator(s) 1148, to operate steering system 1154 via steering actuator(s) 1156, to operate propulsion system 1150 via throttle/accelerator(s) 1152 [transmit parameters of the trained policy model neural network to the autonomous vehicle]. In at least one embodiment, controller(s) 1136 may include one or more onboard ( e.g., integrated) computing devices that process sensor signals, and output operation commands (e.g., signals representing commands) to enable autonomous driving and/or to assist a human driver in driving vehicle 1100.”
Chao and Kalakrishnan are both related to the same field of endeavor (imitation learning). It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the claimed invention to have combined the autonomous vehicle use of Chao to the teachings of Kalakrishnan to arrive at the present invention, in order to improve driving safety through automaton, as stated in Chao, paragraph 0158, “In at least one embodiment, front-facing cameras may be used to perform many similar ADAS functions as LIDAR, including, without limitation, emergency braking, pedestrian detection, and collision avoidance.”
Pascanu teaches:
(bold only) “determining a penalty value based on a product of a logarithm of an output of the policy model neural network using the determined imitation learning expert policy for an action and a state at a time, and a hyperparameter for regularization”: Pascanu, paragraph 0060, “The Bellman updates are softened in the sense that the usual max operator over actions for the state values Vi, is replaced by a soft-max at inverse temperature, which hardens into a max operator as:ß → ∞. The optimal policy πi is then a Boltzmann policy at inverse temperature ß [showing that ß is a temperature value used in training, hence a kind of hyperparameter for regularization]”; Pascanu, paragraph 0055-0056, “The task policies and the multitask policy are generated together in a coordinated training process by optimizing an objective function which comprises a term indicative of expected returns and one or more regularization terms which provide policy regularization [a penalty value ]. A first regularization term ensures that each task policy it, is regularized towards the multitask policy, and may be defined using discounted KL divergences
PNG
media_image3.png
59
202
media_image3.png
Greyscale
An additional regularization term is based on discounted entropy to further encourage exploration. Specifically, the objective function to be maximized is:
PNG
media_image4.png
246
615
media_image4.png
Greyscale
[includes hyperparameter for regularization ß] where cKL and cEnt are scalar factors greater than zero which determine the strengths of the KL and entropy regularizations, α = cKL/(cKL+cEnt,) and ß=1/(cKL+cEnt). The log π0 (at I st) term [a logarithm of an output of the policy model neural network using the … policy for an action and a state at a time] can be thought of as a reward shaping term which encourages actions which have high probability under the multitask policy, while the entropy term -log πt (at I st) encourages exploration. In the above we used the same regularization costs cKL and cEnt for all tasks. However, it is straightforward to generalize this to task-specific costs; this can be important if tasks differ substantially in their reward scales and amounts of exploration needed, although it does introduce additional hyper-parameters.”
“determining a soft Bellman backup value that utilizes a reward and the product of a proportion of a discount factor and the hyperparameter, and a logarithm of the hyperparameter and an estimate of an action value for data at the time at an iteration”: Pascanu, paragraph 0053, “For simplicity we assume that each task has an infinite horizon, and each has the same discount factor γ [showing that γ is a discount factor]”; Pascanu, paragraph 0059, “In step 401, πi is modified with π0 fixed. With π0 fixed, (1) decomposes into separate maximization problems for each task, and is an entropy regularized expected return with a redefined (regularized) reward
PNG
media_image5.png
53
250
media_image5.png
Greyscale
[that utilizes a reward]. It can be optimized using soft Q learning (also known as G learning) which is based on deriving the following ‘softened’ Bellman updates [determining a soft Bellman backup value] for the state and action values (see for example, J. Schulman, P. Abbeel, and X. Chen. Equivalence between policy gradients and soft Q-Learning, arXiv: 1704.06440, 2017):
PNG
media_image6.png
221
929
media_image6.png
Greyscale
[showing that in (2), the ratio 1/ß is included in Vi, which in (3) is a part of a product which includes γ. Hence, the term in (3) includes γ/ß, a proportion of a discount factor and the hyperparameter. Further, (2) includes ß and π0α(at | st), inside a summation which is itself inside “log,” hence, a logarithm of the hyperparameter and an estimate of an action value for data at the time at an iteration. Further, (3) includes reward function R].”
“generating a target signal by combining the penalty value and - soft Bellman backup value; and”: Pascanu, paragraph 0055-0056, “The task policies and the multitask policy are generated together in a coordinated training process by optimizing an objective function which comprises a term indicative of expected returns and one or more regularization terms which provide policy regularization [the penalty value ]. A first regularization term ensures that each task policy it, is regularized towards the multitask policy, and may be defined using discounted KL divergences
PNG
media_image3.png
59
202
media_image3.png
Greyscale
An additional regularization term is based on discounted entropy to further encourage exploration. Specifically, the objective function to be maximized is:
PNG
media_image4.png
246
615
media_image4.png
Greyscale
[showing that formula (1) includes the regularization term acting as a penalty value]”; Pascanu, paragraph 0059, “In step 401, πi is modified with π0 fixed. With π0 fixed, (1) decomposes into separate maximization problems for each task, and is an entropy regularized expected return with a redefined (regularized) reward
PNG
media_image5.png
53
250
media_image5.png
Greyscale
. It can be optimized using soft Q learning (also known as G learning) which is based on deriving the following ‘softened’ Bellman updates [soft Bellman backup value] for the state and action values (see for example, J. Schulman, P. Abbeel, and X. Chen. Equivalence between policy gradients and soft Q-Learning, arXiv: 1704.06440, 2017):
PNG
media_image6.png
221
929
media_image6.png
Greyscale
[hence the formula (2) for target value Vi includes soft Bellman backup value and is an optimization of penalty value from formula (1).
Pascanu and Kalakrishnan are analogous arts as they are both related to reinforcement learning methods. It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the claimed invention to have combined the use of soft Bellman, temperature, and logarithmic functions for training in Pascanu with the teachings of Kalakrishnan to arrive at the present invention, in order to improve model training, as stated in Pascanu, paragraph 0026, “The subject matter described in this specification can be implemented in particular embodiments so as to realize one or more of the following advantages. The methods can be used to more efficiently train a neural network.”
Regarding claim 17:
Kalakrishnan as modified by Pascanu and Chao teaches the teaches the autonomous vehicle training system of claim 16.
Kalakrishnan further teaches “wherein the program instructions further cause the computer to obtain the determined imitation learning expert policy using an expert demonstration”: Kalakrishnan, paragraph 0003, “In some implementations, the meta-learning model can include a trial policy used to gather information about a new task through imitation learning. One or more demonstrations of the new task can be used to generate the trial policy for the new task [obtain the determined imitation learning expert policy using an expert demonstration].
Regarding claim 18:
Kalakrishnan as modified by Pascanu and Chao teaches the teaches the autonomous vehicle training system of claim 17.
Kalakrishnan further teaches “wherein the program instructions further cause the computer to use a supervised machine learning process to obtain the determined imitation learning expert policy”: Kalakrishnan, paragraph 0037, “In some implementations, a distribution of tasks p(Τ) can be assumed, from which the meta-training tasks {Τi} and held-out meta-test {Τi} tasks are drawn. During meta-training, supervision in the form of expert demonstration trajectories {Τi+} and a binary reward function ri, that can be queried for each of the meta-training tasks Τi can be used [use a supervised machine learning process to obtain the determined imitation learning expert policy].”
Regarding claim 19:
Kalakrishnan as modified by Pascanu and Chao teaches the teaches the autonomous vehicle training system of claim 17.
Kalakrishnan further teaches “further comprising generating the expert demonstration from a pre-trained model”: Kalakrishnan, paragraph 0004, “One or more trials of the robot attempting to perform the new task can be generated using the trial policy for the new task. The one or more trials can be used to train the adapted policy using reinforcement learning to perform the new task. Since the trial policy and the adapted trial policy share parameters, training the adapted policy via reinforcement
learning will also update one or more portions of the trial policy. This updated trial policy can then be used to generate additional trial(s) of the new task [generating the expert demonstration from a pre-trained model], which in tum can be used to further train the adapted policy network. In other words, trials for the new task may be continuously generated based on the current trial policy parameters when training the meta-learning model.”
Regarding claim 20:
Kalakrishnan as modified by Pascanu and Chao teaches the teaches the autonomous vehicle training system of claim 16.
Kalakrishnan further teaches “wherein the program instructions further cause the computer to perform fitted Q iteration with the determined imitation learning expert policy, where Q is a function that is defined on a state-action space”: Kalakrishnan, paragraphs 0039-0040: “To leverage both demonstration and trial episode information, a Q-value function Q(s, a; ΘQ) can be maintained that is decomposed into the value and normalized advantage function, as in equation (2). Note that, to incorporate both demonstrations from an expert and trials taken by the agent, an imitation learning objective is not necessarily needed since reinforcement learning objectives can also learn from successful trajectories. Thus, in some implementations, the Bellman error LRL can be used in the inner adaptation step. The adapted Q-value function Q(s, a; Q(s, a; ΦQ') [Q is a function that is defined on a state-action space] can be obtained by taking gradient steps with respect to LRL evaluated on a batch of demonstration and trial episodes {τ1, . . . , τK} corresponding to task Τi:
PNG
media_image1.png
55
253
media_image1.png
Greyscale
where the first episode τ1 is a demonstration and where k ∈ {l, . . . , K} are the trials taken by the agent [perform fitted Q iteration with the determined imitation learning expert policy].”
Regarding claim 21:
Kalakrishnan teaches:
“a hardware processor; and a memory that stores a computer program product, which, when executed by the hardware processor, causes the hardware processor to”: Kalakrishnan, paragraph 0155, “In addition, some implementations include one or more processors (e.g., central processing unit(s) (CPU(s)), graphics processing unit(s) (GPU(s), and/or tensor processing unit(s) (TPU(s)) [a hardware processor] of one or more computing devices, where the one or more processors are operable to execute instructions stored in associated memory, and where the instructions are configured to cause performance of any of the methods described herein. Some implementations also include one or more non-transitory computer readable storage media storing computer instructions executable by one or more processors to perform any of the methods described herein [and a memory that stores a computer program product, which, when executed by the hardware processor, causes the hardware processor].”
“approximate an expert policy for imitation learning to obtain a determined imitation learning expert policy”: Kalakrishnan, paragraph 0003, “In some implementations, the meta-learning model can include a trial policy used to gather information about a new task through imitation learning. One or more demonstrations of the new task [an expert policy] can be used to generate the trial policy for the new task [approximate an expert policy for imitation learning to obtain a determined imitation learning expert policy]. This trial policy can then be used to generate one or more trial and error attempts of preforming the task. In some implementations, the trial policy can be used to constrain potential actions made by the agent when attempting to perform the task in the trial and error attempts. Additionally or alternatively, the meta-learning model can include an adapted trial policy which can be used to extract and integrate information from the trial(s) with the demonstration(s) to learn how to perform the new task. The adapted trial policy can be trained using reinforcement learning.”
“iteratively train a policy model neural network using the determined imitation learning expert policy to perform batch mode reinforcement learning that is based on the determined imitation learning expert policy by”: Kalakrishnan, paragraphs 0039-0040: “To leverage both demonstration and trial episode information, a Q-value function Q(s, a; ΘQ) can be maintained that is decomposed into the value and normalized advantage function, as in equation (2). Note that, to incorporate both demonstrations from an expert and trials taken by the agent, an imitation learning objective is not necessarily needed since reinforcement learning [reinforcement learning ] objectives can also learn from successful trajectories. Thus, in some implementations, the Bellman error LRL can be used in the inner adaptation step. The adapted Q-value function Q(s, a; Q(s, a; ΦQ') can be obtained by taking gradient steps with respect to LRL evaluated on a batch [batch mode ] of demonstration and trial episodes {τ1, . . . , τK} corresponding to task Τi:
PNG
media_image1.png
55
253
media_image1.png
Greyscale
where the first episode τ1 is a demonstration and where k ∈ {l, . . . , K} are the trials taken by the agent [iteratively train a policy model neural network using the determined imitation learning expert policy].“
(bold only) “determining a penalty value based on a product of a logarithm of an output of the policy model neural network using the determined imitation learning expert policy for an action and a state at a time, and a hyperparameter for regularization”: Kalakrishnan, paragraph 0039, “The adapted Q-value function Q(s, a; Q(s, a; ΦQ') can be obtained by taking gradient steps with respect to LRL evaluated on a batch of demonstration and trial episodes {τ1, . . . , τK} corresponding to task Τi:
PNG
media_image1.png
55
253
media_image1.png
Greyscale
where the first episode τ1 is a demonstration and where k ∈ {l, . . . , K} are the trials taken by the agent [determining … an output of the policy model neural network using the determined imitation learning expert policy for an action and a state at a time].”
“modifying the policy model neural network at the iteration to decrease a difference between an output of the policy model neural network and the target signal”: Kalakrishnan, paragraph 0036, “This Q-function can be learned through standard Q-learning techniques, for example by minimizing [modifying the policy model neural network at the iteration to decrease] Bellman error:
PNG
media_image2.png
36
299
media_image2.png
Greyscale
[showing a difference calculation between Q(st, at; ΦQ' ) (an output of the policy model neural network) and rt + γV(st+1;Φ t) (target signal)]”
Kalakrishnan does not explicitly teach:
“A vehicle control system, comprising”
“determine an action for a vehicle in an environment, using state information of the environment as input to the trained policy model neural network”
“and issue an instruction to the vehicle to implement the determined action”
(bold only) “determining a penalty value based on a product of a logarithm of an output of the policy model neural network using the determined imitation learning expert policy for an action and a state at a time, and a hyperparameter for regularization”
“determining a soft Bellman backup value that utilizes a reward and the product of a proportion of a discount factor and the hyperparameter, and a logarithm of the hyperparameter and an estimate of an action value for data at the time at an iteration”
“generating a target signal by combining the penalty value and - soft Bellman backup value”
Chao teaches:
“A vehicle control system, comprising”: Chao, paragraph 0147, “In at least one embodiment, a steering system 1154, which may include, without limitation, a steering wheel, is used to steer vehicle 1100 (e.g., along a desired path or route) when propulsion system 1150 is operating ( e.g., when vehicle 1100 is in motion) [vehicle control system]. In at least one embodiment, steering system 1154 may receive signals from steering actuator(s) 1156. In at least one embodiment, a steering wheel may be optional for full automation (Level 5) functionality. In at least one embodiment, a brake sensor system 1146 may be used to operate vehicle brakes in response to receiving signals from brake actuator(s) 1148 and/or brake sensors.”
“determine an action for a vehicle in an environment, using state information of the environment as input to the trained policy model neural network”: Chao, paragraph 0069, “In at least one embodiment, the robot is enabled to imitate the observed demonstration and apply it to different environments [determine an action for a vehicle in an environment]. For example, given a video demonstration De1 in an environment e1, identification of goal G permits a task planning problem to be solved in a new environment e2 ≠ e1 . The robot can then execute the output plan in e2 based on the goal G that was derived from De1 In at least one embodiment, G is also independent of the agent's state and motion when the task domain definition is shared between e1 and e2 [state information of the environment]. For example, in at least one embodiment, a demonstration is performed in an abstract or artificial environment, and the robot achieves the corresponding goal in a real environment”; Chao, paragraphs 0116-0117, “FIG. 5A illustrates inference and/or training logic 815 used to perform inferencing and/or training operations associated with one or more embodiments. Details regarding inference and/or training logic 815 are provided below in conjunction with FIGS. 5A and/or 8B. In at least one embodiment, inference and/or training logic 815 may include, without limitation, code and/or data storage 801 to store forward and/or output weight and/or input/output data, and/or other parameters to configure neurons or layers of a neural network trained and/or used for inferencing in aspects of one or more embodiments [as input to the trained policy model neural network].”
“and issue an instruction to the vehicle to implement the determined action”: Chao, paragraph 0115, “At 710, the robotic device causes its robotic manipulation device, such as a robotic arm, to manipulate objects in the environment in order to accomplish the goal. In at least one embodiment, the robotic device formulates instructions that cause one or more of its manipulation devices to implement a trajectory [issue an instruction to the vehicle to implement the determined action]. For example, in order to satisfy an in_ pot(ingredient) predicate, the robotic device may formulate instructions such as ‘move arm to x1, y1, z1,’ ‘grasp object,’ move arm to ‘x2, y2, z2', and so on.”
Chao and Kalakrishnan are both related to the same field of endeavor (imitation learning). It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the claimed invention to have combined the autonomous vehicle use of Chao to the teachings of Kalakrishnan to arrive at the present invention, in order to improve driving safety through automaton, as stated in Chao, paragraph 0158, “In at least one embodiment, front-facing cameras may be used to perform many similar ADAS functions as LIDAR, including, without limitation, emergency braking, pedestrian detection, and collision avoidance.”
Pascanu teaches:
(bold only) “determining a penalty value based on a product of a logarithm of an output of the policy model neural network using the determined imitation learning expert policy for an action and a state at a time, and a hyperparameter for regularization”: Pascanu, paragraph 0060, “The Bellman updates are softened in the sense that the usual max operator over actions for the state values Vi, is replaced by a soft-max at inverse temperature, which hardens into a max operator as:ß → ∞. The optimal policy πi is then a Boltzmann policy at inverse temperature ß [showing that ß is a temperature value used in training, hence a kind of hyperparameter for regularization]”; Pascanu, paragraph 0055-0056, “The task policies and the multitask policy are generated together in a coordinated training process by optimizing an objective function which comprises a term indicative of expected returns and one or more regularization terms which provide policy regularization [a penalty value ]. A first regularization term ensures that each task policy it, is regularized towards the multitask policy, and may be defined using discounted KL divergences
PNG
media_image3.png
59
202
media_image3.png
Greyscale
An additional regularization term is based on discounted entropy to further encourage exploration. Specifically, the objective function to be maximized is:
PNG
media_image4.png
246
615
media_image4.png
Greyscale
[includes hyperparameter for regularization ß] where cKL and cEnt are scalar factors greater than zero which determine the strengths of the KL and entropy regularizations, α = cKL/(cKL+cEnt,) and ß=1/(cKL+cEnt). The log π0 (at I st) term [a logarithm of an output of the policy model neural network using the … policy for an action and a state at a time] can be thought of as a reward shaping term which encourages actions which have high probability under the multitask policy, while the entropy term -log πt (at I st) encourages exploration. In the above we used the same regularization costs cKL and cEnt for all tasks. However, it is straightforward to generalize this to task-specific costs; this can be important if tasks differ substantially in their reward scales and amounts of exploration needed, although it does introduce additional hyper-parameters.”
“determining a soft Bellman backup value that utilizes a reward and the product of a proportion of a discount factor and the hyperparameter, and a logarithm of the hyperparameter and an estimate of an action value for data at the time at an iteration”: Pascanu, paragraph 0053, “For simplicity we assume that each task has an infinite horizon, and each has the same discount factor γ [showing that γ is a discount factor]”; Pascanu, paragraph 0059, “In step 401, πi is modified with π0 fixed. With π0 fixed, (1) decomposes into separate maximization problems for each task, and is an entropy regularized expected return with a redefined (regularized) reward
PNG
media_image5.png
53
250
media_image5.png
Greyscale
[that utilizes a reward]. It can be optimized using soft Q learning (also known as G learning) which is based on deriving the following ‘softened’ Bellman updates [determining a soft Bellman backup value] for the state and action values (see for example, J. Schulman, P. Abbeel, and X. Chen. Equivalence between policy gradients and soft Q-Learning, arXiv: 1704.06440, 2017):
PNG
media_image6.png
221
929
media_image6.png
Greyscale
[showing that in (2), the ratio 1/ß is included in Vi, which in (3) is a part of a product which includes γ. Hence, the term in (3) includes γ/ß, a proportion of a discount factor and the hyperparameter. Further, (2) includes ß and π0α(at | st), inside a summation which is itself inside “log,” hence, a logarithm of the hyperparameter and an estimate of an action value for data at the time at an iteration. Further, (3) includes reward function R].”
“generating a target signal by combining the penalty value and - soft Bellman backup value; and”: Pascanu, paragraph 0055-0056, “The task policies and the multitask policy are generated together in a coordinated training process by optimizing an objective function which comprises a term indicative of expected returns and one or more regularization terms which provide policy regularization [the penalty value ]. A first regularization term ensures that each task policy it, is regularized towards the multitask policy, and may be defined using discounted KL divergences
PNG
media_image3.png
59
202
media_image3.png
Greyscale
An additional regularization term is based on discounted entropy to further encourage exploration. Specifically, the objective function to be maximized is:
PNG
media_image4.png
246
615
media_image4.png
Greyscale
[showing that formula (1) includes the regularization term acting as a penalty value]”; Pascanu, paragraph 0059, “In step 401, πi is modified with π0 fixed. With π0 fixed, (1) decomposes into separate maximization problems for each task, and is an entropy regularized expected return with a redefined (regularized) reward
PNG
media_image5.png
53
250
media_image5.png
Greyscale
. It can be optimized using soft Q learning (also known as G learning) which is based on deriving the following ‘softened’ Bellman updates [soft Bellman backup value] for the state and action values (see for example, J. Schulman, P. Abbeel, and X. Chen. Equivalence between policy gradients and soft Q-Learning, arXiv: 1704.06440, 2017):
PNG
media_image6.png
221
929
media_image6.png
Greyscale
[hence the formula (2) for target value Vi includes soft Bellman backup value and is an optimization of penalty value from formula (1).
Pascanu and Kalakrishnan are analogous arts as they are both related to reinforcement learning methods. It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the claimed invention to have combined the use of soft Bellman, temperature, and logarithmic functions for training in Pascanu with the teachings of Kalakrishnan to arrive at the present invention, in order to improve model training, as stated in Pascanu, paragraph 0026, “The subject matter described in this specification can be implemented in particular embodiments so as to realize one or more of the following advantages. The methods can be used to more efficiently train a neural network.”
Regarding claim 22:
Kalakrishnan as modified by Pascanu and Chao teaches the teaches the vehicle control system of claim 21.
Kalakrishnan further teaches “wherein the program instructions further cause the computer to obtain the determined imitation learning expert policy using an expert demonstration”: Kalakrishnan, paragraph 0003, “In some implementations, the meta-learning model can include a trial policy used to gather information about a new task through imitation learning. One or more demonstrations of the new task can be used to generate the trial policy for the new task [obtain the determined imitation learning expert policy using an expert demonstration].
Regarding claim 23:
Kalakrishnan as modified by Pascanu and Chao teaches the teaches the vehicle control system of claim 22.
Kalakrishnan further teaches “wherein the program instructions further cause the computer to use a supervised machine learning process to obtain the determined imitation learning expert policy”: Kalakrishnan, paragraph 0037, “In some implementations, a distribution of tasks p(Τ) can be assumed, from which the meta-training tasks {Τi} and held-out meta-test {Τi} tasks are drawn. During meta-training, supervision in the form of expert demonstration trajectories {Τi+} and a binary reward function ri, that can be queried for each of the meta-training tasks Τi can be used [use a supervised machine learning process to obtain the determined imitation learning expert policy].”
Regarding claim 24:
Kalakrishnan as modified by Pascanu and Chao teaches the teaches the vehicle control system of claim 22.
Kalakrishnan further teaches “further comprising generating the expert demonstration from a pre-trained model”: Kalakrishnan, paragraph 0004, “One or more trials of the robot attempting to perform the new task can be generated using the trial policy for the new task. The one or more trials can be used to train the adapted policy using reinforcement learning to perform the new task. Since the trial policy and the adapted trial policy share parameters, training the adapted policy via reinforcement
learning will also update one or more portions of the trial policy. This updated trial policy can then be used to generate additional trial(s) of the new task [generating the expert demonstration from a pre-trained model], which in tum can be used to further train the adapted policy network. In other words, trials for the new task may be continuously generated based on the current trial policy parameters when training the meta-learning model.”
Regarding claim 25:
Kalakrishnan as modified by Pascanu and Chao teaches the teaches the vehicle control system of claim 21.
Kalakrishnan further teaches “wherein the program instructions further cause the computer to perform fitted Q iteration with the determined imitation learning expert policy, where Q is a function that is defined on a state-action space”: Kalakrishnan, paragraphs 0039-0040: “To leverage both demonstration and trial episode information, a Q-value function Q(s, a; ΘQ) can be maintained that is decomposed into the value and normalized advantage function, as in equation (2). Note that, to incorporate both demonstrations from an expert and trials taken by the agent, an imitation learning objective is not necessarily needed since reinforcement learning objectives can also learn from successful trajectories. Thus, in some implementations, the Bellman error LRL can be used in the inner adaptation step. The adapted Q-value function Q(s, a; Q(s, a; ΦQ') [Q is a function that is defined on a state-action space] can be obtained by taking gradient steps with respect to LRL evaluated on a batch of demonstration and trial episodes {τ1, . . . , τK} corresponding to task Τi:
PNG
media_image1.png
55
253
media_image1.png
Greyscale
where the first episode τ1 is a demonstration and where k ∈ {l, . . . , K} are the trials taken by the agent [perform fitted Q iteration with the determined imitation learning expert policy].”
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Schulman et al., “Equivalence Between Policy Gradients and Soft Q-Learning,” 2018, arXiv:1704.06440v4, discloses, in section 3.5, soft Q-learning, described as an iterative process that minimizes a difference between a model value and a target.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to VINCENT SPRAUL whose telephone number is (703) 756-1511. The examiner can normally be reached M-F 9:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, MICHAEL HUNTLEY can be reached at (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/VAS/Examiner, Art Unit 2129
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129