Office Action Analysis: 17918365 — LEARNING OPTIONS FOR ACTION SELECTION WITH META-GRADIENTS IN MULTI-TASK REINFORCEMENT LEARNING

Examiner Intelligence

DAY, ROBERT N View full profile →
Grants only 21% of cases
Career Allowance Rate
5 granted / 24 resolved
-34.2% vs TC avg
Strong +20% interview lift
Without
With
+20.0%
Interview Lift
resolved cases with interview
Typical timeline
4y 1m
Avg Prosecution
16 currently pending
Career history
61
Total Applications
across all art units
Statute-Specific Performance

§101
4.6%
-35.4% vs TC avg
§103
85.7%
+45.7% vs TC avg
§102
9.7%
-30.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 24 resolved cases
Office Action

§101 §103 §112
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

This action is in response to the application and the preliminary amendment filed 12 October 2022. In the preliminary amendment, Claims 17 and 20-22 are cancelled. Claims 1-16, 18, 19, 23, and 24 are pending and have been examined.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 23 January 2024 and 07 May 2024 are being considered by the examiner.

Specification
The application and preliminary amendments filed 12 October 2022 do not contain an abstract of the disclosure provided on a separate sheet, as required by 37 CFR 1.72(b). While an abstract was included in the 12 October 2022 filing, it is not presented on a separate sheet. Appropriate correction is required.

Claim Rejections - 35 USC § 112(b)
The following is a quotation of 35 U.S.C. 112(b):

(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:

The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 18 and 19 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claims 18 and 19 depend on canceled independent Claim 17. For the purposes of examination, the claims have been interpreted to read: "The system of claim 1, wherein ..." (emphasis added). Appropriate correction is required.

Claim Rejections - 35 USC § 112(d)
The following is a quotation of 35 U.S.C. 112(d):
(d) REFERENCE IN DEPENDENT FORMS.—Subject to subsection (e), a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers.

The following is a quotation of pre-AIA  35 U.S.C. 112, fourth paragraph:
Subject to the following paragraph [i.e., the fifth paragraph of pre-AIA  35 U.S.C. 112], a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers.

Claims 23 and 24 are rejected under 35 U.S.C. 112(d) or pre-AIA  35 U.S.C. 112, 4th paragraph, as being of improper dependent form for failing to further limit the subject matter of the claim upon which it depends, or for failing to include all the limitations of the claim upon which it depends.
Claim 23 recites "the method comprising operations performed by the system of claim 1." As recited, the method of Claim 23 has been interpreted to comprise some or all operations recited as performed by the system of Claim 1, thus it may omit an element from Claim 1. Per MPEP 608.01(n)(III), a claim in proper dependent form must incorporate all the limitations of the claim on which it depends. Claim 23 fails to include all limitations of Claim 1 and is therefore improper.
Claim 24 recites "perform operations of the system of claim 1." As recited, the media storing instructions of Claim 24 has been interpreted to cause a computer to perform some or all operations recited by the system of Claim 1, thus it may omit an element from Claim 1. Claim 24 is therefore improper under the same rationale as Claim 23.
Additionally, Claim 1 does not explicitly recite operations. For the purposes of examination, Claim 23 has been interpreted to incorporate all limitations of Claim 1 that positively recite a step to result from configuration, such as "... wherein the system is configured to, at each of a plurality of time steps, process an input comprising an observation ...." (emphasis added). However, Claim 23 has been interpreted to incorporate only the positively recited steps as operations. For example, Claim 23 has not been interpreted to recite the structural limitations of the system recited by Claim 1, such as the manager and option networks of "the system comprising: a manager neural network, and a set of option policy neural networks each for selecting a sequence of actions." Claim 24 has been interpreted to recite, as operations, only the positively recited steps resulting from the configuration recited by Claim 1, under the same rationale as Claim 23.
Applicant may cancel the claims, amend the claims to place the claims in proper dependent form, rewrite the claims in independent form, or present a sufficient showing that the dependent claims complies with the statutory requirements.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-16, 18, 19, 23, and 24 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.

Regarding Claim 1
Step 1
Claim 1 recites a computer-implemented system, and thus the claimed machine falls within a statutory category of invention.
Step 2A Prong 1
The claim recites controlling an agent to perform a plurality of tasks while interacting with an environment, which is a mental process. The claim recites to generate an output for selecting an action to be performed by the agent, and receive a task reward in response to the action, which is a mental process. The claim recites to generate an output for selecting a manager action from a set of manager actions, wherein the set of manager actions comprises possible actions that can be performed by the agent and a set of option selection actions, each option selection action selecting one of the option policy neural networks, which is a mental process. The claim recites to generate an output for selecting an action to be performed by the agent, which is a mental process. The claim recites wherein, when the selected manager action is an option selection action, the option policy neural network selected by the manager action generates the output for selecting an action for successive time steps until an option termination criterion is met, and when the selected manager action is one of the possible actions that can be performed by the agent the output for selecting the action is the selected manager action, which is a mental process. The claim recites for a time step: process the observation, according to parameter values of the option reward neural network, which is a mental process. The claim recites generate an option reward for the respective option policy neural network, which is a mental process.
Thus, the claim recites an abstract idea.
Step 2A Prong 2, Step 2B
The additional element wherein the system is configured to, at each of a plurality of time steps, process an input comprising an observation characterizing a current state of the environment invokes a computer or other machinery merely as a tool to perform an existing process (see MPEP 2106.05(f), "apply it"). The additional element a manager neural network, and a set of option policy neural networks each for selecting a sequence of actions to be performed by the agent according to a respective option policy invokes a computer or other machinery merely as a tool to perform an existing process (see MPEP 2106.05(f), "apply it"). The additional element wherein the manager neural network is configured to, at a time step: process the observation and data identifying one of the tasks currently being performed by the agent, according to parameter values of the manager neural network invokes a computer or other machinery merely as a tool to perform an existing process (see MPEP 2106.05(f), "apply it"). The additional element wherein each option policy neural network is configured to, at each of a succession of time steps: process the observation for the time step, according to an option policy defined by parameter values of the option policy neural network invokes a computer or other machinery merely as a tool to perform an existing process (see MPEP 2106.05(f), "apply it"). The additional element a set of option reward neural networks, one for each respective option policy neural network, each configured invokes a computer or other machinery merely as a tool to perform an existing process (see MPEP 2106.05(f), "apply it"). The additional element wherein the system is configured to train the set of option reward neural networks and the manager neural network using the task rewards invokes a computer or other machinery merely as a tool to perform an existing process (see MPEP 2106.05(f), "apply it"). The additional element to train each of the option policy neural networks using the option reward for the respective option policy neural network invokes a computer or other machinery merely as a tool to perform an existing process (see MPEP 2106.05(f), "apply it").
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Regarding Claim 2
Step 1
Regarding Claim 2, the rejection of Claim 1 is incorporated.
Step 2A Prong 1
The claim recites parameter values of the option reward neural network are adjusted based on the agent's interaction with the environment under control of the respective option policy neural network, to optimize a return from the environment, which is a mental process.
Thus, the claim recites an abstract idea.
Step 2A Prong 2, Step 2B
The additional element wherein the system is configured to train each option reward neural network using the task reward in a meta-gradient training technique invokes a computer or other machinery merely as a tool to perform an existing process (see MPEP 2106.05(f), "apply it").
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Regarding Claim 3
Step 1
Regarding Claim 3, the rejection of Claim 3 is incorporated.
Step 2A Prong 1
The claim recites to train each of the option policy neural networks using the option reward for the respective option policy neural network by, after the option selection action and for a succession of time steps until the termination criterion is met: updating the parameter values of the manager neural network using the task rewards, which is a mental process. The claim recites updating the parameter values of the respective option policy neural network selected by the option selection action using the option reward for the respective option policy neural network, which is a mental process. The claim recites after the termination criterion is met: updating the parameter values of the option reward neural network for the respective option policy neural network using the task rewards, which is a mental process.
Thus, the claim recites an abstract idea.
Step 2A Prong 2, Step 2B
The additional element to train the set of option reward neural networks and the manager neural network using the task rewards invokes a computer or other machinery merely as a tool to perform an existing process (see MPEP 2106.05(f), "apply it").
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Regarding Claim 4
Step 1
Regarding Claim 4, the rejection of Claim 3 is incorporated.
Step 2A Prong 1
The claim recites wherein updating the parameter values of the option reward neural network for the respective option policy neural network using the task rewards comprises: generating a trajectory comprising a sequence of one or more actions selected by the respective option policy neural network selected by the option selection action, and corresponding observations and task rewards, which is a mental process. The claim recites updating the parameter values of the option reward neural network for the respective option policy neural network using the task rewards from the trajectory, which is a mental process.
Thus, the claim recites an abstract idea.
Step 2A Prong 2, Step 2B
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Regarding Claim 5
Step 1
Regarding Claim 5, the rejection of Claim 4 is incorporated.
Step 2A Prong 1
The claim recites wherein updating the parameter values of the option reward neural network for the respective option policy neural network using the task rewards from the trajectory comprises back propagating gradients of an option reward objective function based on the task rewards from the trajectory through the respective option policy neural network and through the option reward neural network for the respective option policy neural network, which is a mental process.
Thus, the claim recites an abstract idea.
Step 2A Prong 2, Step 2B
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Regarding Claim 6
Step 1
Regarding Claim 6, the rejection of Claim 3 is incorporated.
Step 2A Prong 1
The claim recites wherein updating one or more of the parameter values of the manager neural network, the parameter values of the respective option policy neural network, and the parameter values of the option reward neural network, comprises updating based on an n-step return, which is a mental process.
Thus, the claim recites an abstract idea.
Step 2A Prong 2, Step 2B
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Regarding Claim 7
Step 1
Regarding Claim 7, the rejection of Claim 3 is incorporated.
Step 2A Prong 1
The claim recites wherein updating the parameter values of the manager neural network using the task rewards comprises backpropagating gradients of a manager objective function, wherein updating the parameter values of the respective option policy neural network comprises backpropagating gradients of an option policy objective function, and wherein the manager objective function and option policy objective function each comprise a respective reinforcement learning objective function, which is a mental process.
Thus, the claim recites an abstract idea.
Step 2A Prong 2, Step 2B
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Regarding Claim 8
Step 1
Regarding Claim 8, the rejection of Claim 7 is incorporated.
Step 2A Prong 1
The claim recites wherein updating the parameter values of the manager neural network using the task rewards comprises backpropagating gradients of a manager objective function ... (as recited in Claim 7), wherein the gradients of the manager objective function and of the option policy objective function comprise respective policy gradients
Thus, the claim recites an abstract idea.
Step 2A Prong 2, Step 2B
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Regarding Claim 9
Step 1
Regarding Claim 9, the rejection of Claim 9 is incorporated.
Step 2A Prong 1
The claim recites a set of option termination neural networks, one for each respective option policy neural network, each configured to, at each of the time steps: process the observation, according to parameter values of the option reward neural network, to generate an option termination value for the respective option policy neural network, wherein, for each option reward neural network, the option termination value determines whether the option termination criterion is met, which is a mental process.
Thus, the claim recites an abstract idea.
Step 2A Prong 2, Step 2B
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Regarding Claim 10
Step 1
Regarding Claim 10, the rejection of Claim 9 is incorporated.
Step 2A Prong 1
The claim recites wherein the system is configured to train the option termination neural networks using the task rewards in a meta-gradient training technique in which parameter values of the option termination neural network are adjusted based on the agents interaction with the environment under control of the respective option policy neural network, to optimize a return from the environment, which is a mental process.
Thus, the claim recites an abstract idea.
Step 2A Prong 2, Step 2B
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Regarding Claim 11
Step 1
Regarding Claim 11, the rejection of Claim 9 is incorporated.
Step 2A Prong 1
The claim recites wherein the system is configured to train the set of option termination neural networks by, after the termination criterion is met for a respective option policy neural network: updating the parameter values of the option termination neural network for the respective option policy neural network using the task rewards, which is a mental process.
Thus, the claim recites an abstract idea.
Step 2A Prong 2, Step 2B
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Regarding Claim 12
Step 1
Regarding Claim 12, the rejection of Claim 11 is incorporated.
Step 2A Prong 1
The claim recites wherein updating the parameter values of the option termination neural network for the respective option policy neural network using the task rewards comprises: generating a trajectory comprising a sequence of one or more actions selected by the respective option policy neural network selected by the option selection action, and corresponding observations and task rewards, which is a mental process. The claim recites updating the parameter values of the option termination neural network for the respective option policy neural network using the task rewards from the trajectory, which is a mental process.
Thus, the claim recites an abstract idea.
Step 2A Prong 2, Step 2B
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Regarding Claim 13
Step 1
Regarding Claim 13, the rejection of Claim 12 is incorporated.
Step 2A Prong 1
The claim recites wherein updating the parameter values of the option termination neural network for the respective option policy neural network using the task rewards from the trajectory comprises back propagating gradients of an option termination objective function based on the task rewards from the trajectory through the respective option policy neural network and through the option termination neural network for the respective option policy neural network, which is a mental process.
Thus, the claim recites an abstract idea.
Step 2A Prong 2, Step 2B
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Regarding Claim 14
Step 1
Regarding Claim 14, the rejection of Claim 1 is incorporated.
Step 2A Prong 1
The claim recites wherein the system is configured to train the manager neural network dependent on an estimated return comprising the expected task rewards from the environment when selecting manager actions according to current parameter values of the manager neural network and on a switching cost, which is a mental process.
Thus, the claim recites an abstract idea.
Step 2A Prong 2, Step 2B
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Regarding Claim 15
Step 1
Regarding Claim 15, the rejection of Claim 14 is incorporated.
Step 2A Prong 1
The claim recites wherein the system is configured to train the manager neural network dependent ... on a switching cost (as recited in Claim 14), wherein the switching cost is configured to reduce the task reward or return used to update the parameter values of the manager neural network, which is a mental process.
Thus, the claim recites an abstract idea.
Step 2A Prong 2, Step 2B
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Regarding Claim 16
Step 1
Regarding Claim 16, the rejection of Claim 1 is incorporated.
Step 2A Prong 1
Claim 16 recites the mental processes recited by parent Claim 1.
Step 2A Prong 2, Step 2B
The additional element wherein the set of option policy neural networks comprises a set of option policy neural network heads on a shared option policy neural network body invokes a computer or other machinery merely as a tool to perform an existing process (see MPEP 2106.05(f), "apply it"). The additional element wherein the set of option reward neural networks comprises a set of option reward neural network heads on a shared option reward neural network body invokes a computer or other machinery merely as a tool to perform an existing process (see MPEP 2106.05(f), "apply it").
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Regarding Claim 18
Step 1
Regarding Claim 18, the rejection of canceled Claim 17 is not incorporated. For the purposes of examination, Claim 18 has been interpreted to depend on Claim 1, which recites a computer-implemented system, and to incorporate the rejection of Claim 1.
Step 2A Prong 1
The claim recites after the training, to select one or more further actions to be performed in the environment in response to one or more observations to receive one or more task rewards, which is a mental process.
Thus, the claim recites an abstract idea.
Step 2A Prong 2, Step 2B
The additional element wherein training the respective option reward neural network comprises using the selected option policy neural network invokes a computer or other machinery merely as a tool to perform an existing process (see MPEP 2106.05(f), "apply it"). The additional element training the respective option reward neural network using the task rewards received in response to the further actions invokes a computer or other machinery merely as a tool to perform an existing process (see MPEP 2106.05(f), "apply it").
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Regarding Claim 19
Step 1
Regarding Claim 19, the rejection of canceled Claim 17 is not incorporated. For the purposes of examination, Claim 19 has been interpreted to depend on Claim 1, which recites a computer-implemented system, and to incorporate the rejection of Claim 1.
Step 2A Prong 1
The claim recites each providing an option termination value according to parameter values of the option termination neural network that determines whether the option termination criterion is met for the respective option policy neural network, which is a mental process. The claim recites fixing the parameter values of the option termination neural network during processing of the observations for the successive time steps by the selected option policy neural network, which is a mental process.
Thus, the claim recites an abstract idea.
Step 2A Prong 2, Step 2B
The additional element maintaining a set of option termination neural networks, one for each respective option policy neural network invokes a computer or other machinery merely as a tool to perform an existing process (see MPEP 2106.05(f), "apply it"). The additional element after processing of the observations for the successive time steps by the selected option policy neural network, training the respective option termination neural network using the task rewards invokes a computer or other machinery merely as a tool to perform an existing process (see MPEP 2106.05(f), "apply it").
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Regarding Claim 23
The rejection of Claim 23 under 35 U.S.C. 101 is made in light of the deficiencies pointed out in the rejection of Claim 23 under 35 U.S.C. 112(d). The method of Claim 23 is interpreted to recite only those operations positively recited by Claim 1, upon which it depends.
Step 1
Claim 23 recites a method, and thus the claimed process falls within a statutory category of invention. 
Step 2A Prong 1
Claim 23 recites the abstract ideas recited by Claim 1, upon which it depends.
Step 2A Prong 2, Step 2B
The additional element one or more computers invokes a computer or other machinery merely as a tool to perform an existing process (see MPEP 2106.05(f), "apply it").
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Regarding Claim 24
The rejection of Claim 24 under 35 U.S.C. 101 is made in light of the deficiencies pointed out in the rejection of Claim 24 under 35 U.S.C. 112(d). The method of Claim 24 is interpreted to recite only those operations positively recited by Claim 1, upon which it depends.
Step 1
Claim 24 recites one or more non-transitory computer storage media storing instructions, and thus the claimed manufacture falls within a statutory category of invention.
Step 2A Prong 1
Claim 24 recites the abstract ideas recited by Claim 1, upon which it depends.
Step 2A Prong 2, Step 2B
The additional element executed by one or more computers cause the one or more computers to perform operations invokes a computer or other machinery merely as a tool to perform an existing process (see MPEP 2106.05(f), "apply it").
The claim lacks additional elements that integrate it into a practical application or provide significantly more, so it is directed to an abstract idea and is ineligible.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1, 3, 4, 6, 16, 18, 23, and 24 are rejected under 35 U.S.C. 103 as being unpatentable over Florensa, et al., "Stochastic neural networks for hierarchical reinforcement learning" (hereinafter "Florensa") in view of Henderson, et al., "OptionGAN: Learning joint reward-policy options using generative adversarial inverse reinforcement learning" (hereinafter "Henderson").
Regarding Claim 1, Florensa teaches:
A computer-implemented (Florensa, p. 6, 6 Experiments: "we report our stronger results with the exact same setting in Appendix B. Our hyperparameters for the neural network architectures and algorithms are detailed in the Appendix A and the full code is available" and footnote 4: "Code available at: https://github.com/florensacc/snn4hrl") system for controlling an agent to perform a plurality of tasks (Florensa, p. 1, Abstract: "we propose a general framework that first learns useful skills in a pre-training environment, and then leverages the acquired skills for learning faster in downstream tasks," where Florensa's framework that learns corresponds to the instant agent) while interacting with an environment (Florensa, p. 4, 5.2 Stochastic Neural Networks For Skill Learning: "To learn several skills at the same time, we propose to use Stochastic Neural Networks (SNNs), a general class of neural networks with stochastic units in the computation graph. ... ¶ For our purpose, we use a simple class of SNNs, where latent variables with fixed distributions are integrated with the inputs to the neural network (here, the observations from the environment) to form a joint embedding"), wherein the system is configured to, at each of a plurality of time steps, process an input comprising an observation characterizing a current state of the environment (Florensa, p. 8, 5.4 Learning High-Level Policies: "Given a span of                         
                            K
                        
                     skills learned during the pre-training task ... . we leverage the provided skills by freezing them and training a high-level policy (Manager Neural Network) that operates by selecting a skill and committing to it for a fixed amount of steps                         
                            T
                        
                    . ... For any given task                         
                            M
                            ∈
                            M
                        
                     we train a new Manager NN on top of the common skills. Given the factored representation of the state space                         
                            
                                    S
                                
                                    M
                                
                     as                         
                            
                                    S
                                
                                    agent
                                
                     and                         
                            
                                    S
                                
                                    rest
                                
                                    M
                                
                    , the high-level policy receives the full state as input," where Florensa's state comprises observations, as in p. 4, 5.2 Stochastic Neural Networks For Skill Learning: "latent variables with fixed distributions are integrated with the inputs to the neural network (here, the observations from the environment)") to generate an output for selecting an action to be performed by the agent (Florensa, p. 5, 5.4 Learning High-Level Policies: "Given the factored representation of the state space                         
                            
                                    S
                                
                                    M
                                
                     as                         
                            
                                    S
                                
                                    agent
                                
                     and                         
                            
                                    S
                                
                                    rest
                                
                                    M
                                
                    , the high-level policy receives the full state as input, and outputs the ... distribution from which we sample a discrete action"), and receive a task reward in response to the action (Florensa, p. 5, 5.4 Learning High-Level Policies: "Given a span of                         
                            K
                        
                     skills learned during the pre-training task, we now describe how to use them as basic building blocks for solving tasks where only sparse reward signals are provided"), the system comprising:
a manager neural network (Florensa, p. 5, 5.4 Learning High-Level Policies: "Instead of learning from scratch the low-level controls, we leverage the provided skills by freezing them and training a high-level policy (Manager Neural Network) that operates by selecting a skill and committing to it for a fixed amount of steps                         
                            T
                        
                    "), and a set of option policy neural networks (Florensa, p. 4, Fig. 1(b), depicting an integration of multiple policy networks, each having input of unique latent variable                         
                            
                                    z
                                
                                    n
                                
                    , and 5.2 Stochastic Neural Networks For Skill Learning: "we study using a simple bilinear integration, by forming the outer product between the observation and the latent variable (Fig. 1(b)). Note that ... the bilinear integration [effectively corresponds] to changing all the first hidden layer weights. ... Our bilinear integration already yields a large span of skills, hence no other type of SNNs is studied in this work") each for selecting a sequence of actions to be performed by the agent according to a respective option policy (Florensa, p. 3, 3 Preliminaries: "We define a discrete-time finite-horizon discounted Markov decision process (MDP) .... In policy search methods, we typically optimize a stochastic policy ... parametrized by 𝜃. The objective is to maximize its expected discounted return ... where                         
                            τ
                            =
                            
                                            s
                                        
                                            0
                                        
                                    ,
                                    
                                            a
                                        
                                            0
                                        
                                    ,
                                    …
                                
                     denotes the whole trajectory," where Flroensa's trajectory of the optimized policy corresponds to the instant sequence of actions); 
wherein the manager neural network is configured to, at a time step: process the observation and data identifying one of the tasks currently being performed by the agent, according to parameter values of the manager neural network, to generate an output for selecting a manager action from a set of manager actions (Florensa, p. 6, Fig. 2, depicting a trained manager neural network at a single time step with input                         
                            
                                    S
                                
                                    M
                                
                     and output                         
                            z
                        
                    , and 5.4 Learning High-Level Policies: "For any given task                         
                            M
                            ∈
                            M
                        
                     we train a new Manager NN on top of the common skills. Given the factored representation of the state space ..., the high-level policy receives the full state as input, and outputs the parametrization of a categorical distribution from which we sample a discrete action                         
                            z
                        
                     out of                         
                            K
                        
                     possible choices, corresponding to the                         
                            K
                        
                     available skills. ...                         
                            z
                        
                     dictates the policy to use during the following                         
                            T
                        
                     time-steps. If the skills are encapsulated in a SNN,                         
                            z
                        
                     is used in place of the latent variable," where Florensa's sampled action                         
                            z
                        
                     corresponds to the generated output, and where Florensa's trained manager neural network is trained according to parameters by inherency under BRI), wherein the set of manager actions comprises possible actions that can be performed by the agent and a set of option selection actions, each option selection action selecting one of the option policy neural networks (Florensa, p. 5, 5.4 Learning High-Level Policies: "Given a span of                         
                            K
                        
                     skills learned during the pre-training task, we now describe how to use them as basic building blocks for solving tasks .... [W]e leverage the provided skills by ... training a high-level policy (Manager Neural Network) that operates by selecting a skill and committing to it for a fixed amount of steps                         
                            T
                        
                    ," where Florensa's                         
                            z
                        
                     corresponds to a possible action and distribution corresponds to the set of option selection actions, as in p. 6, 5.4 Learning High-Level Policies: "the high-level policy receives the full state as input, and outputs the parametrization of a categorical distribution from which we sample a discrete action                         
                            z
                        
                     out of                         
                            K
                        
                     possible choices, corresponding to the                         
                            K
                        
                     available skills");
wherein each option policy neural network is configured to (Florensa, p. 4, 5.2 Stochastic Neural Networks For Skill Learning: "we study using a simple bilinear integration, by forming the outer product between the observation and the latent variable (Fig. 1(b)). Note that ... the bilinear integration [effectively corresponds] to changing all the first hidden layer weights," where Florensa's first hidden layer weights corresponds to the instant policy network configuration), at each of a succession of time steps: process the observation for the time step (Florensa, p. 6, 5.4 Learning High-Level Policies: "If those skills are independently trained uni-modal policies,                         
                            z
                        
                     dictates the policy to use during the following                         
                            T
                        
                     time-steps"), according to an option policy defined by parameter values of the option policy neural network, to generate an output for selecting an action to be performed by the agent (Florensa, p. 4, 5.2 Stochastic Neural Networks For Skill Learning: "we use a simple class of SNNs, where latent variables with fixed distributions are integrated with the inputs to the neural network (here, the observations from the environment) to form a joint embedding, which is then fed to a standard feed-forward neural network (FNN) with deterministic units, that computes distribution parameters for a uni-modal distribution (e.g. the mean and variance parameters of a multivariate Gaussian)" and p. 10, 8 Discussion And Future Work: "we only used feedforward architectures and hence the decision of what skill to use next only depends on the observation at the moment of switching, not using any sensory information gathered while the previous skill was active");
wherein, when the selected manager action is an option selection action, the option policy neural network selected by the manager action generates the output for selecting an action for successive time steps (Florensa, p. 6, 5.4 Learning High-Level Policies: "For any given task                         
                            M
                            ∈
                            M
                        
                     we train a new Manager NN on top of the common skills. Given the factored representation of the state space                         
                            
                                    S
                                
                                    M
                                
                     as                         
                            
                                    S
                                
                                    agent
                                
                     and                         
                            
                                    S
                                
                                    rest
                                
                                    M
                                
                    , the high-level policy receives the full state as input, and outputs the parametrization of a categorical distribution from which we sample a discrete action                         
                            z
                        
                     out of                         
                            K
                        
                     possible choices, corresponding to the K available skills") until an option termination criterion is met (Florensa, p. 6, 5.4 Learning High-Level Policies: "If those skills are independently trained uni-modal policies,                         
                            z
                        
                     dictates the policy to use during the following                         
                            T
                        
                     time-steps," where Florensa's fixed number of time steps corresponds to the instant termination criterion), and when the selected manager action is one of the possible actions that can be performed by the agent the output for selecting the action is the selected manager action (Florensa, p. 6, 5.4 Learning High-Level Policies: "For any given task                         
                            M
                            ∈
                            M
                        
                     we train a new Manager NN on top of the common skills. ... [T]he high-level policy receives the full state as input, and outputs the parametrization of a categorical distribution from which we sample a discrete action                         
                            z
                        
                     out of                         
                            K
                        
                     possible choices," where Florensa's sampled action z of K possible actions corresponds to the instant manager action); and
... option reward (Florensa, p. 4, 5.1 Constructing the pre-training environment: "we use a generic proxy reward as the only reward signal to guide skill learning. The design of the proxy reward should encourage the existence of locally optimal solutions, which will correspond to different skills the agent should learn. In other words, it encodes the prior knowledge about what high level behaviors might be useful in the downstream tasks, rewarding all of them roughly equally") ... configured to, for a time step: process the observation ... to generate an option reward ... (Florensa, p. 3, 3 Preliminaries: "We define a discrete-time finite-horizon discounted Markov decision process (MDP) ... in which ...                         
                            r
                            :
                            S
                            ×
                            A
                            →
                            
                                    -
                                    
                                            R
                                        
                                            max
                                        
                                    ,
                                    
                                            R
                                        
                                            max
                                        
                     a bounded reward function .... The objective is to maximize its expected discounted return,                         
                            η
                            
                                            π
                                        
                                            θ
                                        
                            =
                            
                                    E
                                
                                    τ
                                
                                            ∑
                                            
                                                t
                                                =
                                                0
                                            
                                                T
                                            
                                                    γ
                                                
                                                    t
                                                
                                            r
                                            
                                                            s
                                                        
                                                            t
                                                        
                                                    ,
                                                    
                                                            a
                                                        
                                                            t
                                                        
                    , where                         
                            τ
                            =
                            
                                            s
                                        
                                            0
                                        
                                    ,
                                    
                                            a
                                        
                                            0
                                        
                                    ,
                                    …
                                
                     denotes the whole trajectory") ... according to parameter values (Florensa, p. 5, 5.3 Information-Theoretic Regularization: "To penalize [entropy function]                         
                            H
                            
                                    Z
                                
                                    C
                                
                     ... we modify the reward received at every step as specified in Eq. (1), where                         
                            
                                    p
                                
                                ^
                            
                     ... is an estimate of the posterior probability of the latent code                         
                            
                                    z
                                
                                    n
                                
                     sampled on rollout                         
                            n
                        
                    , given the coordinates                         
                            
                                    c
                                
                                    t
                                
                                    n
                                
                     at time t of that rollout," where Florensa's Eq. defines reward parameters);
wherein the system is configured to train the ... option reward ... and the manager neural network using the task rewards, and to train each of the option policy neural networks using the option reward for the respective option policy neural network (Florensa, p. 5, 5.4 Learning High-Level Policies: "Given a span of                         
                            K
                        
                     skills learned during the pre-training task, we now describe how to use them as basic building blocks for solving tasks where only sparse reward signals are provided. Instead of learning from scratch the low-level controls, we leverage the provided skills by freezing them and training a high-level policy (Manager Neural Network) that operates by selecting a skill and committing to it for a fixed amount of steps                         
                            T
                        
                    ").
Florensa teaches a system for controlling an agent, comprising an option reward configured to, for a time step: process an observation to generate an option reward according to parameter values.
Florensa may not explicitly teach a set of option reward neural networks, one for each respective option policy neural network, each configured to ... process the observation, according to parameter values of the option reward neural network, to generate an option reward for the respective option policy neural network for a time step.
However, Henderson teaches:
a set of option reward neural networks, one for each respective option policy neural network, each configured to ... (Henderson, p. 3201, Reward-Policy Options Framework: "we extend the options framework for decomposing rewards as well as policies. ... In this case, an option is formulated by a tuple:                         
                            
                                            I
                                        
                                            ω
                                        
                                    ,
                                    
                                            π
                                        
                                            ω
                                        
                                    ,
                                    
                                            β
                                        
                                            ω
                                        
                                    ,
                                    
                                            r
                                        
                                            ω
                                        
                    . Here,                         
                            
                                    r
                                
                                    ω
                                
                     is a reward option from which a corresponding intra-option policy                         
                            
                                    π
                                
                                    ω
                                
                     is derived. That is, each policy option is optimized with respect to its own local reward option. The policy-over-options not only chooses the intra-option policy, but the reward option as well:                         
                            
                                    π
                                
                                    Ω
                                
                            →
                            
                                            r
                                        
                                            ω
                                        
                                    ,
                                    
                                            π
                                        
                                            ω
                                        
                    "): process the observation, according to parameter values of the option reward neural network, to generate an option reward for the respective option policy neural network (Henderson, p. 3202, Mixture-of-Experts as Options: "As we can see in Eq. 2, we formulate our discriminator loss in the same way, using each reward option and the policy-over-options as the experts and gating function respectively. This ensures that the policy-over-options specializes over the state space and converges to a deterministic selection of experts, where Henderson's state corresponds to the instant observation) ... for a time step (Henderson, p. 3200, Preliminaries and Notation, The Options framework: "we instead simplify to one-step options, where                         
                            
                                    β
                                
                                    ω
                                
                                    s
                                
                            =
                            1
                        
                     .... we find that our options still converge to temporally extended and interpretable actions," where Henderson's one-step option corresponds to the instant per time step).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Florensa regarding a system for controlling an agent, comprising an option reward configured to, for a time step, process an observation to generate an option reward according to parameter values with those of Henderson regarding a set of option reward neural networks, one for each respective option policy neural network, each configured to process the observation, according to parameter values of the option reward neural network, to generate an option reward for the respective option policy neural network for a time step.
The motivation to do so would be to facilitate learning multiple underlying reward functions corresponding to option policies while training policies (Henderson, p. 3201, Reward-Policy Options Framework: "Based on the need to infer a decomposition of underlying reward functions from a wide range of expert demonstrations in one-shot transfer learning, we extend the options framework for decomposing rewards as well as policies. In this way, intra-option policies, decomposed rewards, and the policy-over-options can all be learned in concert in a cohesive framework").

Regarding Claim 23, Florensa teaches:
a method performed by one or more computers (Florensa, p. 5, 5.3 Information-Theoretic Regularization: "Given that we use a batch policy optimization method, we use all trajectories of the current batch"), the method comprising operations performed by the system of claim 1. Claim 23 is rejected under the same rationale as Claim 1. The rejection of Claim 23 under 35 U.S.C. 103 is made in light of the deficiencies pointed out in the rejection of Claim 23 under 35 U.S.C. 112(d). The method of Claim 23 is interpreted to recite only those operations positively recited by Claim 1.

Regarding Claim 24, Florensa teaches:
one or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations of the system of claim 1 (Florensa, p. 7, 6 Experiments: "Our hyperparameters for the neural network architectures and algorithms are detailed in the Appendix A and the full code is available [at: https://github.com/
florensacc/snn4hrl]" and repository file README.md: "To reproduce the results, you should first have rllab and Mujoco v1.31 configured. Then, run the following commands in the root folder of rllab.... ¶ Then you can ... [t]rain a hierarchical policy on top of that SNN via python," where a non-transitory computer storage media storing instructions is inherent in training by way of the source). Claim 24 is rejected under the same rationale as Claim 1. The rejection of Claim 24 under 35 U.S.C. 103 is made in light of the deficiencies pointed out in the rejection of Claim 24 under 35 U.S.C. 112(d). The method of Claim 24 is interpreted to recite only those operations positively recited by Claim 1.

Regarding Claim 3, the rejection of Claim 1 is incorporated.
The Florensa/Henderson combination teaches:
wherein the system is configured to train the set of option reward neural networks (Florensa, p. 6, Algorithm 1: Skill training for SNNs with MI bonus, line 5 "Collect rollout with                         
                            
                                    z
                                
                                    n
                                
                     fixed" and line 8, "Modify                         
                            
                                    R
                                
                                    t
                                
                                    n
                                
                            ←
                            
                                    R
                                
                                    t
                                
                                    n
                                
                            +
                            …
                        
                     , " where Florensa's                         
                            
                                    z
                                
                                    n
                                
                     corresponds to a selected task and                         
                            
                                    R
                                
                                    n
                                
                                    t
                                
                     corresponds to a reward updated per the preceding task) and the manager neural network using the task rewards (Florensa, p. 2, 1 Introduction: "Our experiments find that our hierarchical policy-learning framework can learn a wide range of skills, which are clearly interpretable as the latent code is varied. Furthermore, we show that training high-level policies on top of the learned skills can results in strong performance on a set of challenging tasks with long horizons and sparse rewards," where Florensa refers to the "high-level policy" elsewhere as a manager neural network), and 
to train each of the option policy neural networks using the option reward for the respective option policy neural network by, after the option selection action and for a succession of time steps until the termination criterion is met: ... updating the parameter values of the respective option policy neural network selected by the option selection action using the option reward for the respective option policy neural network (Florensa, p. 6, Algorithm 1: Skill training for SNNs with MI bonus, line 5, "Collect rollout with                         
                            
                                    z
                                
                                    n
                                
                     fixed," which indicates selection of a single policy/task, and "we add an additional reward bonus .... As entropy is a measure of uncertainty, another interpretation of this bonus is that, given where the robot is, it should be easy to infer which skill the robot is currently performing. To penalize we modify the reward received at every step as specified in Eq. (1), where                         
                            
                                    p
                                
                                ^
                            
                     ... is an estimate of the posterior probability of the latent code                         
                            
                                    z
                                
                                    n
                                
                     sampled on rollout                         
                            n
                        
                    , given the coordinates cnt at time                         
                            t
                        
                     of that rollout," where Florensa's reward modified with bonus and rollout correspond to the instant option policy reward and until termination, respectively, and where Florensa indicates that return is optimized according to a reward calculated on a per-step basis, as in p. 3, 3 Preliminaries: "The objective is to maximize its expected discounted return,                         
                            η
                            
                                            π
                                        
                                            θ
                                        
                            =
                            
                                    E
                                
                                    τ
                                
                                            ∑
                                            
                                                t
                                                =
                                                0
                                            
                                                T
                                            
                                                    γ
                                                
                                                    t
                                                
                                            r
                                            
                                                            s
                                                        
                                                            t
                                                        
                                                    ,
                                                    
                                                            a
                                                        
                                                            t
                                                        
                    ");
after the termination criterion is met: updating the parameter values of the option reward neural network for the respective option policy neural network using the task rewards (Florensa, p. 6, Algorithm 1: Skill training for SNNs with MI bonus, line 8, "Modify                         
                            
                                    R
                                
                                    t
                                
                                    n
                                
                            ←
                            
                                    R
                                
                                    t
                                
                                    n
                                
                            +
                            …
                        
                     , " where Florensa's                         
                            
                                    z
                                
                                    n
                                
                     is a parameter of reward function                         
                            
                                    R
                                
                                    n
                                
                                    t
                                
                    ).
Henderson further teaches:
after the option selection action and for a succession of time steps until the termination criterion is met: updating the parameter values of the manager neural network using the task rewards (Henderson, p. 3202, Algorithm 2, OptionGAN, line 4, "Update ... policy-over-options parameters                         
                            ζ
                        
                    ," where Henderson's policy over options corresponds to the manager, and p. 3201, Learning Joint Reward-Policy Options: "we reformulate our discriminator loss as a weighted mixture of completely specialized experts in Eq. 2. This allows us to update the parameters of the policy-over-options," where Eq. 2 has a term based on Eq. 7, which is calculated based on reward outputs per option policy).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the Florensa/Henderson combination regarding processing the observation and data identifying one of the tasks currently being performed by the agent according to parameter values of the manager neural network with the further teachings of Henderson regarding after the option selection action and for a succession of time steps until the termination criterion is met: updating the parameter values of the manager neural network using the task rewards. 
The motivation to do so would be to facilitate training the manager network to select an option policy for a task in a more deterministic manner (Henderson, p. 3202, Mixture-of-Experts as Options: "To ensure that our MoE formulation converges to options in the optimal case, we must properly formulate our loss function such that the gating function specializes over experts. ... ¶ This can intuitively be interpreted as encouraging the gating function to increase the likelihood of choosing an expert when its loss is less than the average loss of all the experts. The gating function will thus move toward deterministic selection of experts").

Regarding Claim 4, the rejection of Claim 3 is incorporated.
The Florensa/Henderson combination teaches:
wherein updating the parameter values of the option reward neural network for the respective option policy neural network using the task rewards comprises: generating a trajectory comprising a sequence of one or more actions ... and corresponding observations and task rewards (Florensa, p. 3, 3 Preliminaries: "We define a discrete-time finite-horizon discounted Markov decision process (MDP) ... in which ...                         
                            r
                            :
                            S
                            ×
                            A
                            →
                            
                                    -
                                    
                                            R
                                        
                                            max
                                        
                                    ,
                                    
                                            R
                                        
                                            max
                                        
                     a bounded reward function .... The objective is to maximize its expected discounted return,                         
                            η
                            
                                            π
                                        
                                            θ
                                        
                            =
                            
                                    E
                                
                                    τ
                                
                                            ∑
                                            
                                                t
                                                =
                                                0
                                            
                                                T
                                            
                                                    γ
                                                
                                                    t
                                                
                                            r
                                            
                                                            s
                                                        
                                                            t
                                                        
                                                    ,
                                                    
                                                            a
                                                        
                                                            t
                                                        
                    , where                         
                            τ
                            =
                            
                                            s
                                        
                                            0
                                        
                                    ,
                                    
                                            a
                                        
                                            0
                                        
                                    ,
                                    …
                                
                     denotes the whole trajectory") ... selected by the respective option policy neural network selected by the option selection action (Florensa, p. 6, 5.4 Learning High-Level Policies: "For any given task                         
                            M
                            ∈
                            M
                        
                     we train a new Manager NN on top of the common skills. Given the factored representation of the state space                         
                            
                                    S
                                
                                    M
                                
                     as                         
                            
                                    S
                                
                                    agent
                                
                     and                         
                            
                                    S
                                
                                    rest
                                
                                    M
                                
                    , the high-level policy receives the full state as input, and outputs the parametrization of a categorical distribution from which we sample a discrete action                         
                            z
                        
                     out of                         
                            K
                        
                     possible choices, corresponding to the K available skills").
Henderson further teaches:
updating the parameter values of the option reward neural network for the respective option policy neural network using the task rewards from the trajectory (Henderson, p. 3202, Algorithm 2: OptionGAN, lines 3 and 4, "Sample trajectories                         
                            
                                    τ
                                
                                    N
                                
                            ~
                            
                                    π
                                
                                            Θ
                                        
                                            i
                                        
                    " and "Update discriminator parameters                         
                            
                                    θ
                                
                                ^
                            
                    ,                         
                            ω
                        
                     and policy-over-options parameters                         
                            ζ
                        
                    " where Henderson's reward parameters are, p. 3201, Learning Joint Reward-Policy Options: "The reward for a given state is composed as:                         
                            
                                    R
                                
                                    Ω
                                    ,
                                    
                                            Θ
                                        
                                        ^
                                    
                                    s
                                
                     ... where ...                         
                            
                                    θ
                                
                                ^
                            
                            ∈
                            
                                    Θ
                                
                                ^
                            
                     are the parameters of the ... reward options").
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the Florensa/Henderson combination regarding generating a trajectory comprising a sequence of one or more actions and corresponding observations and task rewards with the further teachings of Henderson regarding updating the parameter values of the option reward neural network for the respective option policy neural network using the task rewards from the trajectory. 
The motivation to do so would be to facilitate a learning scenario supporting end-to-end training of specialized experts (Henderson, p. 3201, Learning Joint Reward-Policy Options: "The use of one-step options allows us to learn a policy-over-options in an end-to-end fashion as a Mixture-of-Experts formulation. In the one-step case, selecting an option ... using the policy-over-options ... can be viewed as a mixture of completely specialized experts").

Regarding Claim 6, the rejection of Claim 3 is incorporated.
The Florensa/Henderson combination teaches:
wherein updating one or more of the parameter values of the manager neural network, the parameter values of the respective option policy neural network, and the parameter values of the option reward neural network, comprises updating based on an n-step return (Florensa, p. 3, 3 Preliminaries: "We define a discrete-time finite-horizon discounted Markov decision process (MDP) ... in which ...                         
                            r
                            :
                            S
                            ×
                            A
                            →
                            
                                    -
                                    
                                            R
                                        
                                            max
                                        
                                    ,
                                    
                                            R
                                        
                                            max
                                        
                     a bounded reward function .... The objective is to maximize its expected discounted return,                         
                            η
                            
                                            π
                                        
                                            θ
                                        
                            =
                            
                                    E
                                
                                    τ
                                
                                            ∑
                                            
                                                t
                                                =
                                                0
                                            
                                                T
                                            
                                                    γ
                                                
                                                    t
                                                
                                            r
                                            
                                                            s
                                                        
                                                            t
                                                        
                                                    ,
                                                    
                                                            a
                                                        
                                                            t
                                                        
                    , where                         
                            τ
                            =
                            
                                            s
                                        
                                            0
                                        
                                    ,
                                    
                                            a
                                        
                                            0
                                        
                                    ,
                                    …
                                
                     denotes the whole trajectory," where Florensa's expected value of sum from 0 to T corresponds to the instant n-step return).

Regarding Claim 16, the rejection of Claim 1 is incorporated.
The Florensa/Henderson combination teaches:
wherein the set of option policy neural networks comprises a set of option policy neural network heads on a shared option policy neural network body (Florensa, p. 4, Fig. 1(b), depicting an integration of multiple policy networks as heads of a single network, each having input of unique latent variable                         
                            
                                    z
                                
                                    n
                                
                    , and 5.2 Stochastic Neural Networks For Skill Learning: "we study using a simple bilinear integration, by forming the outer product between the observation and the latent variable (Fig. 1(b)). Note that ... the bilinear integration [effectively corresponds] to changing all the first hidden layer weights. ... Our bilinear integration already yields a large span of skills, hence no other type of SNNs is studied in this work")
Henderson has been shown to teach:
wherein the set of option reward neural networks comprises a set of option reward neural network heads on a shared option reward neural network body (as recited in the rejection of Claim 1, Henderson, p. 3202, Mixture-of-Experts as Options: "As we can see in Eq. 2, we formulate our discriminator loss in the same way, using each reward option and the policy-over-options as the experts and gating function respectively. This ensures that the policy-over-options specializes over the state space and converges to a deterministic selection of experts," where Henderson's policy-over-options corresponds to the instant shared body).

Regarding Claim 18, the method of canceled Claim 17 is not incorporated. For the purposes of examination, Claim 18 is interpreted to depend on and further limit the system of Claim 1.
The Florensa/Henderson combination teaches:
wherein training the respective option reward neural network comprises using the selected option policy neural network, after the training, to select one or more further actions to be performed in the environment (Florensa, p. 5, 5.4 Learning High-Level Policies: "Given a span of                         
                            K
                        
                     skills learned during the pre-training task .... we leverage the provided skills by freezing them and training a high-level policy (Manager Neural Network) that operates by selecting a skill and committing to it for a fixed amount of steps.... ¶ we show in our experiments that frozen low-level policies are already sufficient to achieve good performance in the studied downstream tasks") in response to one or more observations to receive one or more task rewards (Florensa, p. 6, 6 Experiments: "Here we report the results using the Swimmer robot, also described in the benchmark paper. In fact, the swimmer locomotion task described therein corresponds exactly to our pretrain task, as we also solely reward speed in a plain environment"), and 
training the respective option reward neural network using the task rewards received in response to the further actions (Florensa, p. 6, Algorithm 1: Skill training for SNNs with MI bonus, line 5 "Collect rollout with                         
                            
                                    z
                                
                                    n
                                
                     fixed" and line 8, "Modify                         
                            
                                    R
                                
                                    t
                                
                                    n
                                
                            ←
                            
                                    R
                                
                                    t
                                
                                    n
                                
                            +
                            …
                        
                     , " where Florensa's                         
                            
                                    z
                                
                                    n
                                
                     corresponds to a selected task and                         
                            
                                    R
                                
                                    n
                                
                                    t
                                
                     corresponds to a reward updated per the preceding task).

Claims 2, 5, 7, and 8 are rejected under 35 U.S.C. 103 as being unpatentable over Florensa, et al., "Stochastic neural networks for hierarchical reinforcement learning" (hereinafter "Florensa") in view of Henderson, et al., "OptionGAN: Learning joint reward-policy options using generative adversarial inverse reinforcement learning" (hereinafter "Henderson") in view of Xu, et al., "Meta-gradient reinforcement learning" (hereinafter "Xu").
Regarding Claim 2, the rejection of Claim 1 is incorporated.
Henderson has been shown to teach:
wherein the system is configured to train each option reward neural network using the task reward in a ... training technique ... to optimize a return from the environment ...(as recited in the rejection of Claim 1, Henderson, p. 3201, Reward-Policy Options Framework: "an option is formulated by a tuple:                         
                            
                                            I
                                        
                                            ω
                                        
                                    ,
                                    
                                            π
                                        
                                            ω
                                        
                                    ,
                                    
                                            β
                                        
                                            ω
                                        
                                    ,
                                    
                                            r
                                        
                                            ω
                                        
                    . Here,                         
                            
                                    r
                                
                                    ω
                                
                     is a reward option from which a corresponding intra-option policy                         
                            
                                    π
                                
                                    ω
                                
                     is derived. That is, each policy option is optimized with respect to its own local reward option. The policy-over-options not only chooses the intra-option policy, but the reward option as well:                         
                            
                                    π
                                
                                    Ω
                                
                            →
                            
                                            r
                                        
                                            ω
                                        
                                    ,
                                    
                                            π
                                        
                                            ω
                                        
                    ") in which parameter values of the option reward neural network are adjusted based on the agent's interaction with the environment under control of the respective option policy neural network (as recited in the rejection of Claim 1, Henderson, p. 3202, Mixture-of-Experts as Options: "As we can see in Eq. 2, we formulate our discriminator loss in the same way, using each reward option and the policy-over-options as the experts and gating function respectively. This ensures that the policy-over-options specializes over the state space and converges to a deterministic selection of experts, where Henderson's state corresponds to the instant observation).
The Florensa/Henderson combination teaches using the task reward in a training technique.
The Florensa/Henderson combination does not explicitly teach using the task reward in a meta-gradient training technique.
However, Xu teaches:
using the task reward in a meta-gradient training technique (Xu, p. 3, 1.1 Applying Meta-Gradients to Returns: "we view the return                         
                            g
                        
                     as a function parameterised by meta-parameters                         
                            η
                        
                    , which may be differentiated to understand its dependence on                         
                            η
                        
                    . This in turn allows us to compute the gradient                         
                            ∂
                            f
                            /
                            ∂
                            η
                        
                     of the update function with respect to the meta-parameters                         
                            η
                        
                    , and hence the meta-gradient                         
                            ∂
                            J
                            '
                            
                                    τ
                                    '
                                    ,
                                    θ
                                    '
                                    ,
                                    η
                                    '
                                
                            /
                            ∂
                            η
                        
                    ," where Xu's return as a cumulative reward corresponds to the instant reward).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Florensa/Henderson combination regarding the system being configured to train each option reward neural network using the task reward in a training technique to optimize a return from the environment with those of Xu regarding using the task reward in a meta-gradient training technique.
The motivation to do so would be to facilitate a reinforcement learning where learning parameters are adjusted during training based on model performance (Xu, p. 3, 1.1 Applying Meta-Gradients to Returns: "A typical RL algorithm would hand-select the meta-parameters, such as the discount factor                         
                            γ
                        
                     and bootstrapping parameter                         
                            λ
                        
                    , and these would be held fixed throughout training. ... In essence, our agent asks itself the question, 'which return results in the best performance?', and adjusts its meta-parameters accordingly").

Regarding Claim 5, the rejection of Claim 4 is incorporated.
The Florensa/Henderson combination teaches updating the parameter values of the option reward neural network for the respective option policy neural network using the task rewards from the trajectory.
The Florensa/Henderson combination does not explicitly teach back propagating gradients of an option reward objective function based on the task rewards from the trajectory through the respective option policy neural network and through the option reward neural network for the respective option policy neural network.
However, Xu teaches:
wherein updating the parameter values of the option reward neural network for the respective option policy neural network using the task rewards from the trajectory comprises: back propagating gradients of an option reward objective function (Xu, p. 5, 1.4 Conditioned Value and Policy Functions: "To deal with non-stationarity in the value function and policy, we utilise an idea similar to universal value function approximation .... ¶ The key idea is to provide the metaparameters                         
                            η
                        
                     as an additional input to condition the value function and policy....                         
                            
                                    W
                                
                                    η
                                
                     is the embedding matrix (or row vector, for scalar                         
                            η
                        
                    ) that is updated by backpropagation during training," where Xu's value function as cumulative reward corresponds to the instant reward objective) based on the task rewards from the trajectory through the respective option policy neural network and through the option reward neural network for the respective option policy neural network (Xu, p. 2, 1 Meta-Gradient Reinforcement Learning Algorithms: "At the core of the algorithm is an update function ... that adjusts parameters from a sequence of experience                         
                            
                                    τ
                                
                                    t
                                
                            =
                            {
                            
                                    S
                                
                                    t
                                
                            ,
                            
                                    A
                                
                                    t
                                
                            ,
                            
                                    R
                                
                                    t
                                    +
                                    1
                                
                            ,
                            …
                            }
                        
                     consisting of states                         
                            S
                        
                    , actions                         
                            A
                        
                     and rewards                         
                            R
                        
                    . The nature of the function is determined by meta-parameters                         
                            η
                        
                    ").
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the Florensa/Henderson combination regarding updating the parameter values of the option reward neural network for the respective option policy neural network using the task rewards from the trajectory with those of Xu regarding back propagating gradients of an option reward objective function based on the task rewards from the trajectory through the respective option policy neural network and through the option reward neural network for the respective option policy neural network.
The motivation to do so would be to facilitate adaption in training scenarios where approximation of returns become inaccurate (Xu, p. 5, 1.4 Conditioned Value and Policy Functions: "there is a danger that the value function                         
                            
                                    v
                                
                                    θ
                                
                     becomes inaccurate, since it may be approximating old returns. ¶ ... In this way, the agent explicitly learns value functions and policies that are appropriate for various                         
                            η
                        
                    . The approximation problem becomes a little harder, but the payoff is that the algorithm can freely shift the meta-parameters without needing to wait for the approximator to 'catch up'").

Regarding Claim 7, the rejection of Claim 3 is incorporated.
The Florensa/Henderson combination teaches:
wherein the manager objective function (Florensa, p. 3, 3 Preliminaries: "In policy search methods, we typically optimize a stochastic policy ... parametrized by 𝜃. The objective is to maximize its expected discounted return,                         
                            η
                            
                                            π
                                        
                                            θ
                                        
                            =
                            
                                    E
                                
                                    τ
                                
                                            ∑
                                            
                                                t
                                                =
                                                0
                                            
                                                T
                                            
                                                    γ
                                                
                                                    t
                                                
                                            r
                                            
                                                            s
                                                        
                                                            t
                                                        
                                                    ,
                                                    
                                                            a
                                                        
                                                            t
                                                        
                    , where                         
                            τ
                            =
                            
                                            s
                                        
                                            0
                                        
                                    ,
                                    
                                            a
                                        
                                            0
                                        
                                    ,
                                    …
                                
                     denotes the whole trajectory," where Florensa's expected discounted return corresponds to the instant manager objective) and option policy objective function each comprise a respective reinforcement learning objective function (Florensa, p. 5, 5.3 Information-Theoretic Regularization: "It is desirable to have direct control over the diversity of skills that will be learned. To achieve this, we introduce an information-theoretic regularizer, inspired by recent success of similar objectives in encouraging interpretable representation learning in InfoGAN," where Florensa's regularizer corresponds to the instant option policy comprised objective function).
The Florensa/Henderson combination teaches updating the parameter values of the manager neural network using the task rewards and updating the parameter values of the option reward neural network.
The Florensa/Henderson combination does not explicitly teach wherein updating the parameter values of the manager neural network using the task rewards comprises backpropagating gradients of a manager objective function and wherein updating the parameter values of the respective option policy neural network comprises backpropagating gradients of an option policy objective function.
However, Xu teaches:
wherein updating the parameter values of the manager neural network using the task rewards comprises backpropagating gradients of a manager objective function, wherein updating the parameter values of the respective option policy neural network comprises backpropagating gradients of an option policy objective function (Xu, p. 5, 1.4 Conditioned Value and Policy Functions: "To deal with non-stationarity in the value function and policy, we utilise an idea similar to universal value function approximation .... The key idea is to provide the meta-parameters                         
                            η
                        
                     as an additional input to condition the value function and policy, as follows:
                                    
                                                v
                                            
                                                θ
                                            
                                                η
                                            
                                                S
                                            
                                        =
                                        
                                                v
                                            
                                                θ
                                            
                                                        S
                                                        ;
                                                        
                                                                e
                                                            
                                                                η
                                                            
                                                π
                                            
                                                θ
                                            
                                                η
                                            
                                                S
                                            
                                        =
                                        
                                                π
                                            
                                                θ
                                            
                                                        S
                                                        ;
                                                        
                                                                e
                                                            
                                                                η
                                                            
                                                e
                                            
                                                η
                                            
                                        =
                                        
                                                W
                                            
                                                η
                                            
                                        η
                                    
... ¶... The key idea is to provide the meta-parameters                         
                            η
                        
                     as an additional input to condition the value function and policy....                         
                            
                                    W
                                
                                    η
                                
                     is the embedding matrix (or row vector, for scalar                         
                            η
                        
                    ) that is updated by backpropagation during training," where Xu's value function and policy function correspond to the instant manager network and option policy network, respectively, and                         
                            η
                        
                     corresponds to the instant parameters, and where Xu reasonably suggests a gradient of the meta-parameter for update, as in p. 3, 1.2 Meta-Gradient Prediction: "The key idea of the meta-gradient prediction algorithm is to adjust meta-parameters in the direction that achieves the best predictive accuracy").
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the Florensa/Henderson combination regarding updating the parameter values of the manager neural network using the task rewards and updating the parameter values of the option reward neural network with those of Xu regarding wherein updating the parameter values of the manager neural network using the task rewards comprises backpropagating gradients of a manager objective function and wherein updating the parameter values of the respective option policy neural network comprises backpropagating gradients of an option policy objective function.
The motivation to do so would be to facilitate training under circumstances where the return function is not stationary (Xu, p. 4, 1.4 Conditioned Value and Policy Functions: "One complication of the approach outlined above is that the return function                         
                            
                                    g
                                
                                    η
                                
                                    τ
                                
                     is non-stationary, adapting along with the meta-parameters throughout the training process. As a result, there is a danger that the value function                         
                            
                                    v
                                
                                    θ
                                
                     becomes inaccurate, since it may be approximating old returns. ... ¶ To deal with non-stationarity in the value function and policy, we utilise an idea similar to universal value function approximation").

Regarding Claim 8, the rejection of Claim 7 is incorporated.
Xu further teaches:
wherein the gradients of the manager objective function (Xu, p. 3, 1.2 Meta-Gradient Prediction: "The objective of the TD(                        
                            λ
                        
                    ) algorithm ... is to minimise the squared error between the value function approximator ... and the ... return ... [Eq. 8] ... where                         
                            τ
                        
                     is a sampled trajectory starting with state S, and                         
                            ∂
                            J
                        
                     ... is a semi-gradient," where Xu's value function corresponds to the instant manager objective, and comprises the policy gradient                         
                            ∂
                            J
                        
                    ) and of the option policy objective function comprise respective policy gradients (Xu, p. 4, 1.3 Meta-Gradient Control: "The semi-gradient of the A2C objective                         
                            ∂
                            J
                        
                     ... is defined as ... [Eq. 12]. The first term represents a control objective, encouraging the policy                         
                            
                                    π
                                
                                    θ
                                
                     to select actions that maximise the return," where Xu's policy                         
                            
                                    π
                                
                                    θ
                                
                     corresponds to the instant option objective).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the Florensa/Henderson/Xu combination regarding updating the manager objective function and the option policy objective function with the further teachings of Xu regarding the gradients of the manager objective function and of the option policy objective function comprise respective policy gradients.
The motivation to do so would be to facilitate learning scenarios where the agent may choose policies according to learned meta-parameters, allowing it to take advantage of an improved return function (Xu, p. 3, 1.1 Applying Meta-Gradients to Returns: "we view the return                         
                            g
                        
                     as a function parameterised by meta-parameters                         
                            η
                        
                    , which may be differentiated to understand its dependence on                         
                            η
                        
                    . This in turn allows us to compute the gradient ... of the update function with respect to the meta-parameters                         
                            η
                        
                    , and hence the meta-gradient ... In essence, our agent asks itself the question, 'which return results in the best performance?', and adjusts its meta-parameters accordingly").

Claims 9, 11-14, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Florensa, et al., "Stochastic neural networks for hierarchical reinforcement learning" (hereinafter "Florensa") in view of Henderson, et al., "OptionGAN: Learning joint reward-policy options using generative adversarial inverse reinforcement learning" (hereinafter "Henderson") in view of Bacon, et al., "The option-critic architecture" (hereinafter "Bacon").
Regarding Claim 9, the rejection of Claim 1 is incorporated.
The Florensa/Henderson combination teaches that the option policy neural network selected by the manager action generates the output until an option termination criterion is met.
The Florensa/Henderson combination does not explicitly teach a set of option termination neural networks, one for each respective option policy neural network, each configured to, at each of the time steps: process the observation, according to parameter values of the option reward neural network, to generate an option termination value for the respective option policy neural network wherein, for each option reward neural network, the option termination value determines whether the option termination criterion is met.
However, Bacon teaches:
further comprising a set of option termination neural networks, one for each respective option policy neural network (Bacon, p. 2, Learning Options: "We consider the call-and-return option execution model, in which an agent picks option                         
                            ω
                        
                     according to its policy over options                         
                            
                                    π
                                
                                    Ω
                                
                    , then follows the intra-option policy                         
                            
                                    π
                                
                                    ω
                                
                     until termination (as dictated by                         
                            
                                    β
                                
                                    ω
                                
                    ), at which point this procedure is repeated. Let                         
                            
                                    π
                                
                                    ω
                                    ,
                                    θ
                                
                     denote the intra-option policy of option                         
                            ω
                        
                     parametrized by                         
                            θ
                        
                     and                         
                            
                                    β
                                
                                    ω
                                    ,
                                    θ
                                
                    , the termination function of                         
                            ω
                        
                     parameterized by                         
                            θ
                        
                    " and p. 5, Arcade Learning Environment: "We applied the option-critic architecture in the Arcade Learning Environment ... using a deep neural network to approximate the critic and represent the intra-option policies and termination functions"), each configured to, at each of the time steps:
process the observation, according to parameter values of the option reward neural network, to generate an option termination value for the respective option policy neural network (Bacon, p. 4, Algorithm 1: Option-critic with tabular intra-option Q-learning, line 9, terms                         
                            
                                    β
                                
                                    ω
                                    ,
                                    θ
                                
                                            s
                                        
                                            '
                                        
                     for observation                         
                            
                                    s
                                
                                    '
                                
                    , and line 15, "if                         
                            
                                    β
                                
                                    ω
                                    ,
                                    θ
                                
                     terminates in                         
                            
                                    s
                                
                                    '
                                
                    ", where parameter                         
                            ω
                        
                     corresponds to the instant option policy),
wherein, for each option reward neural network, the option termination value determines whether the option termination criterion is met (Bacon, p. 4, Algorithm 1: Option-critic with tabular intra-option Q-learning, lines 15 and 16, "if                         
                            
                                    β
                                
                                    ω
                                    ,
                                    θ
                                
                     terminates in                         
                            
                                    s
                                
                                    '
                                
                     choose new                         
                            ω
                        
                    ", where                         
                            ω
                        
                     corresponds to the instant option policy).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the Florensa/Henderson regarding that the option policy neural network selected by the manager action generates the output until an option termination criterion is met with those of Bacon regarding a set of option termination neural networks, one for each respective option policy neural network, each configured to, at each of the time steps: process the observation, according to parameter values of the option reward neural network, to generate an option termination value for the respective option policy neural network wherein, for each option reward neural network, the option termination value determines whether the option termination criterion is met.
The motivation to do so would be to facilitate training scenarios where option choices are suboptimal by providing early termination (Bacon p. 3, Learning Options: "when the option choice is suboptimal with respect to the expected value over all options ... it drives the gradient corrections up, which increases the odds of terminating. After termination, the agent has the opportunity to pick a better option using                         
                            
                                    π
                                
                                    Ω
                                
                     [i.e., an agent's policy over options]. ... The termination gradient theorem can be interpreted as providing a gradient-based interrupting Bellman operator").

Regarding Claim 11, the rejection of Claim 9 is incorporated.
Bacon further teaches:
wherein the system is configured to train the set of option termination neural networks by, after the termination criterion is met for a respective option policy neural network: updating the parameter values of the option termination neural network for the respective option policy neural network using the task rewards (Bacon, p. 3, Algorithms and Architecture: "we can now design a stochastic gradient descent algorithm for learning options. Using a two-timescale framework ..., we propose to learn ... while updating the ... termination functions at a slower rate" and p. 6, Arcade Learning Environment: "As a consequence of optimizing for the return, the termination gradient tends to shrink options over time. This is expected since in theory primitive actions are sufficient for solving any MDP. We tackled this issue by adding a small                         
                            ξ
                        
                    = 0.01 term to the advantage function, used by the termination gradient                         
                            
                                    A
                                
                                    Ω
                                
                                    s
                                    ,
                                    ω
                                
                            +
                            ξ
                            =
                            
                                    Q
                                
                                    Ω
                                
                                    s
                                    ,
                                    ω
                                
                            –
                            
                                    V
                                
                                    Ω
                                
                                    s
                                
                            +
                            ξ
                        
                    . This term has a regularization effect, by imposing an                         
                            ξ
                        
                    -margin between the value estimate of an option and that of the 'optimal' one reflected in                         
                            
                                    V
                                
                                    Ω
                                
                    ," where Bacon's value V is an expected return and based on the result of a reward function).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the Florensa/Henderson/Bacon combination regarding a set of option termination networks that calculate an option termination value that determines whether the option termination criterion is met with the further teachings of Bacon regarding updating the parameter values of the option termination neural network for the respective option policy neural network using the task rewards. 
The motivation to do so would be to facilitate learning scenarios where the optimizing for the expected return based on reward values may take advantage of early termination of underperforming policies (Bacon, p. 3, Learning Options: "in our case, it follows as a direct consequence of the derivation and gives the theorem an intuitive interpretation: when the option choice is suboptimal with respect to the expected value over all options, the advantage function is negative and it drives the gradient corrections up, which increases the odds of terminating. After termination, the agent has the opportunity to pick a better option using                         
                            
                                    π
                                
                                    Ω
                                
                    ").

Regarding Claim 12, the rejection of Claim 11 is incorporated.
Bacon further teaches:
wherein updating the parameter values of the option termination neural network for the respective option policy neural network using the task rewards comprises: generating a trajectory comprising a sequence of one or more actions selected by the respective option policy neural network selected by the option selection action, and corresponding observations and task rewards (Bacon, p. 2, Learning Options: "Suppose we aim to optimize directly the discounted return, expected over all the trajectories starting at a designated state                         
                            
                                    s
                                
                                    0
                                
                     and option                         
                            
                                    ω
                                
                                    0
                                
                    , the                         
                            ρ
                            
                                    Ω
                                    ,
                                    θ
                                    ,
                                    ϑ
                                    ,
                                    
                                            s
                                        
                                            0
                                        
                                    ,
                                    
                                            ω
                                        
                                            0
                                        
                            =
                            
                                    E
                                
                                    Ω
                                    ,
                                    θ
                                    ,
                                    ω
                                
                                            ∑
                                            
                                                t
                                                =
                                                0
                                            
                                                ∞
                                            
                                                    γ
                                                
                                                    t
                                                
                                                    r
                                                
                                                    t
                                                    +
                                                    1
                                                
                                            s
                                        
                                            0
                                        
                                    ,
                                    
                                            ω
                                        
                                            0
                                        
                    . Note that this return depends on the policy over options, as well as the parameters of the option policies and termination functions" and p. 4, Algorithm 1, "Option-critic with tabular intra-option Q-learning," where a non-terminating sequence of                         
                            s
                        
                     and                         
                            
                                    s
                                
                                    '
                                
                     for an option policy                         
                            
                                    π
                                
                                    ω
                                    ,
                                    θ
                                
                     in the repeat loop corresponds to the instant a trajectory); and
updating the parameter values of the option termination neural network for the respective option policy neural network using the task rewards from the trajectory (Bacon, p. 6, Arcade Learning Environment: "As a consequence of optimizing for the return, the termination gradient tends to shrink options over time. This is expected since in theory primitive actions are sufficient for solving any MDP. We tackled this issue by adding a small                         
                            ξ
                        
                    = 0.01 term to the advantage function, used by the termination gradient                         
                            
                                    A
                                
                                    Ω
                                
                                    s
                                    ,
                                    ω
                                
                            +
                            ξ
                            =
                            
                                    Q
                                
                                    Ω
                                
                                    s
                                    ,
                                    ω
                                
                            –
                            
                                    V
                                
                                    Ω
                                
                                    s
                                
                            +
                            ξ
                        
                    . This term has a regularization effect, by imposing an                         
                            ξ
                        
                    -margin between the value estimate of an option and that of the 'optimal' one reflected in                         
                            
                                    V
                                
                                    Ω
                                
                    ," where Bacon's value V is an expected return and based on the result of a reward function).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the Florensa/Henderson/Bacon combination teaches regarding XXX with the further teachings of Bacon regarding wherein updating the parameter values of the option termination neural network for the respective option policy neural network using the task rewards comprises: generating a trajectory comprising a sequence of one or more actions selected by the respective option policy neural network selected by the option selection action, and corresponding observations and task rewards; and updating the parameter values of the option termination neural network for the respective option policy neural network using the task rewards from the trajectory.
The motivation to do so would be to facilitate learning scenarios where the optimizing for the expected return based on reward values may take advantage of early termination of underperforming policies (Bacon, p. 3, Learning Options: "in our case, it follows as a direct consequence of the derivation and gives the theorem an intuitive interpretation: when the option choice is suboptimal with respect to the expected value over all options, the advantage function is negative and it drives the gradient corrections up, which increases the odds of terminating. After termination, the agent has the opportunity to pick a better option using                         
                            
                                    π
                                
                                    Ω
                                
                    ").

Regarding Claim 13, the rejection of Claim 12 is incorporated.
Bacon further teaches:
wherein updating the parameter values of the option termination neural network for the respective option policy neural network using the task rewards from the trajectory comprises: back propagating gradients of an option termination objective function based on the task rewards from the trajectory through the respective option policy neural network and through the option termination neural network for the respective option policy neural network (Bacon, p. 4, Algorithm 1: Option-critic with tabular intra-option Q-learning, lines 13 and 14 in section "2. Options improvement," where the option network parameters                         
                            θ
                        
                     and termination network parameters                         
                            ϑ
                        
                     are updated according to the respective gradients for the given option policy                         
                            
                                    π
                                
                                    ω
                                    ,
                                    θ
                                
                    , where a non-terminating sequence of                         
                            s
                        
                     and                         
                            
                                    s
                                
                                    '
                                
                     for the option policy                         
                            
                                    π
                                
                                    ω
                                    ,
                                    θ
                                
                     in the repeat loop corresponds to the instant trajectory through the policy and termination networks).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the Florensa/Henderson/Bacon combination regarding updating the parameter values of the option termination neural network for the respective option policy neural network with the further teachings of Bacon regarding back propagating gradients of an option termination objective function based on the task rewards from the trajectory through the respective option policy neural network and through the option termination neural network for the respective option policy neural network.
The motivation to do so would be to facilitate developing learning scenarios without requiring additional rewards or subgoals while remaining flexible and efficient with respect to the learning environment (Bacon, p. 1, Abstract: "We derive policy gradient theorems for options and propose a new option-critic architecture capable of learning both the internal policies and the termination conditions of options, in tandem with the policy over options, and without the need to provide any additional rewards or subgoals. Experimental results in both discrete and continuous environments showcase the flexibility and efficiency of the framework").

Regarding Claim 14, the rejection of Claim 1 is incorporated.
The Florensa/Henderson combination teaches:
wherein the system is configured to train the manager neural network dependent on an estimated return comprising the expected task rewards from the environment (Florensa, p. 3, 3 Preliminaries: "We define a discrete-time finite-horizon discounted Markov decision process (MDP) ... in which ...                         
                            r
                            :
                            S
                            ×
                            A
                            →
                            
                                    -
                                    
                                            R
                                        
                                            max
                                        
                                    ,
                                    
                                            R
                                        
                                            max
                                        
                     a bounded reward function .... The objective is to maximize its expected discounted return,                         
                            η
                            
                                            π
                                        
                                            θ
                                        
                            =
                            
                                    E
                                
                                    τ
                                
                                            ∑
                                            
                                                t
                                                =
                                                0
                                            
                                                T
                                            
                                                    γ
                                                
                                                    t
                                                
                                            r
                                            
                                                            s
                                                        
                                                            t
                                                        
                                                    ,
                                                    
                                                            a
                                                        
                                                            t
                                                        
                    , where                         
                            τ
                            =
                            
                                            s
                                        
                                            0
                                        
                                    ,
                                    
                                            a
                                        
                                            0
                                        
                                    ,
                                    …
                                
                     denotes the whole trajectory") when selecting manager actions (Florensa, p. 6, Figure 2, "Hierarchical SNN architecture to solve downstream tasks," depicting the manager neural network sampling the discrete action z based on current state) according to current parameter values of the manager neural network (Florensa, p. 5, 5.4 Learning High-Level Policies: "we leverage the provided skills by freezing them and training a high-level policy (Manager Neural Network) that operates by selecting a skill and committing to it for a fixed amount of steps ...." and "The weights of the low level and high level neural networks could also be jointly optimized to adapt the skills to the task at hand. ... Nevertheless, we show in our experiments that frozen low-level policies are already sufficient to achieve good performance in the studied downstream tasks").
The Florensa/Henderson combination teaches training the manager neural network dependent on an estimated return comprising the expected task rewards from the environment.
The Florensa/Henderson combination does not explicitly teach train the manager neural network dependent ... on a switching cost.
However, Bacon teaches:
train the manager neural network dependent ... on a switching cost (Bacon, p. 7, Discussion: "if one wanted to use additional pseudo-rewards, the option-critic framework would easily accommodate it. In this case, the internal policies and termination function gradients would simply need to be taken with respect to the pseudo-rewards instead of the task reward. A simple instance of this idea, which we used in some of the experiments, is to use additional rewards to encourage options that are indeed temporally extended by adding a penalty whenever a switching event occurs").
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the Florensa/Henderson combination regarding training the manager neural network dependent on an estimated return comprising the expected task rewards from the environment with those of Bacon regarding training the manager neural network dependent on a switching cost.
The motivation to do so would be to facilitate supporting processing scenarios where task options are temporally extended (Bacon, p. 7, Discussion: "We developed a general gradient-based approach ... in order to optimize a performance objective for the task at hand. ... A simple instance of this idea, which we used in some of the experiments, is to use additional rewards to encourage options that are indeed temporally extended by adding a penalty whenever a switching event occurs. Our approach can work seamlessly with any other heuristic for biasing the set of options towards some desirable property (e.g. compositionality or sparsity), as long as it can be expressed as an additive reward structure").

Regarding Claim 19, the rejection of canceled Claim 17 is not incorporated. For the purposes of examination, Claim 19 is interpreted to depend the system of Claim 1.
The Florensa/Henderson combination teaches that the option policy neural network selected by the manager action generates the output until an option termination criterion is met.
The Florensa/Henderson combination does not explicitly teach maintaining a set of option termination neural networks, one for each respective option policy neural network, each providing an option termination value according to parameter values of the option termination neural network that determines whether the option termination criterion is met for the respective option policy neural network, and fixing the parameter values of the option termination neural network during processing of the observations for the successive time steps by the selected option policy neural network, and after processing of the observations for the successive time steps by the selected option policy neural network, training the respective option termination neural network using the task rewards.
However, Bacon teaches:
maintaining a set of option termination neural networks, one for each respective option policy neural network, each providing an option termination value according to parameter values of the option termination neural network (Bacon, p. 2, Learning Options: "We consider the call-and-return option execution model, in which an agent picks option                         
                            ω
                        
                     according to its policy over options                         
                            
                                    π
                                
                                    Ω
                                
                    , then follows the intra-option policy                         
                            
                                    π
                                
                     until termination (as dictated by                         
                            
                                    β
                                
                                    ω
                                
                    ), at which point this procedure is repeated. Let                         
                            
                                    π
                                
                                    ω
                                    ,
                                    θ
                                
                     denote the intra-option policy of option                         
                            ω
                        
                     parametrized by                         
                            θ
                        
                     and                         
                            
                                    β
                                
                                    ω
                                    ,
                                    θ
                                
                    , the termination function of                         
                            ω
                        
                     parameterized by                         
                            ϑ
                        
                    " and p. 5, Arcade Learning Environment: "We applied the option-critic architecture in the Arcade Learning Environment ... using a deep neural network to approximate the critic and represent the intra-option policies and termination functions") that determines whether the option termination criterion is met for the respective option policy neural network (Bacon, p. 4, Algorithm 1: Option-critic with tabular intra-option Q-learning, line 9, terms                         
                            
                                    β
                                
                                    ω
                                    ,
                                    θ
                                
                                            s
                                        
                                            '
                                        
                     for observation                         
                            
                                    s
                                
                                    '
                                
                    , and line 15, "if                         
                            
                                    β
                                
                                    ω
                                    ,
                                    θ
                                
                     terminates in                         
                            
                                    s
                                
                                    '
                                
                    ", where parameter                         
                            ω
                        
                     corresponds to the instant option policy), and
fixing the parameter values of the option termination neural network during processing of the observations for the successive time steps by the selected option policy neural network (Bacon, p. 4, Algorithm 1: Option-critic with tabular intra-option Q-learning, lines 4-8 of the repeat loop, where the termination network's parameters                         
                            
                                    β
                                
                                    ω
                                    ,
                                    ϑ
                                
                     are frozen for handling the subsequent observation                         
                            
                                    s
                                
                                    '
                                
                     during each iteration of the repeat loop until a terminal                         
                            
                                    s
                                
                                    '
                                
                    ), and after processing of the observations for the successive time steps by the selected option policy neural network, training the respective option termination neural network using the task rewards (Bacon, p. 4, Algorithm 1: Option-critic with tabular intra-option Q-learning, line 14, depicting an update step of termination network parameters                         
                            ϑ
                        
                     according to gradients of the parameters and the value function, which is based on the returns and thus the rewards)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the Florensa/Henderson combination regarding teaches that the option policy neural network selected by the manager action generates the output until an option termination criterion is met with those of Bacon regarding maintaining a set of option termination neural networks, one for each respective option policy neural network, each providing an option termination value according to parameter values of the option termination neural network that determines whether the option termination criterion is met for the respective option policy neural network, and fixing the parameter values of the option termination neural network during processing of the observations for the successive time steps by the selected option policy neural network, and after processing of the observations for the successive time steps by the selected option policy neural network, training the respective option termination neural network using the task rewards.
The motivation to do so would be to facilitate training scenarios where option choices may be suboptimal by providing early termination by way of learning termination conditions of option policies along with the policies (Bacon p. 3, Learning Options: "when the option choice is suboptimal with respect to the expected value over all options ... it drives the gradient corrections up, which increases the odds of terminating. After termination, the agent has the opportunity to pick a better option using                         
                            
                                    π
                                
                                    Ω
                                
                     [i.e., an agent's policy over options]. ... The termination gradient theorem can be interpreted as providing a gradient-based interrupting Bellman operator").

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Florensa, et al., "Stochastic neural networks for hierarchical reinforcement learning" (hereinafter "Florensa") in view of Henderson, et al., "OptionGAN: Learning joint reward-policy options using generative adversarial inverse reinforcement learning" (hereinafter "Henderson") in view of Bacon, et al., "The option-critic architecture" (hereinafter "Bacon") in view of Xu, et al., "Meta-gradient reinforcement learning" (hereinafter "Xu").
Regarding Claim 10, the rejection of Claim 9 is incorporated.
Bacon further teaches:
wherein the system is configured to train (Bacon, p. 5, Arcade Learning Environment: "We fixed the learning rate for the intra-option policies and termination gradient to 0:00025 and used RMSProp for the critic") the option termination neural networks using the task rewards in a ... gradient training technique in which parameter values of the option termination neural network are adjusted ... to optimize a return from the environment (Bacon, p. 2, Learning Options: "Suppose we aim to optimize directly the discounted return, expected over all the trajectories starting at a designated state                         
                            
                                    s
                                
                                    0
                                
                     and option                         
                            
                                    ω
                                
                                    0
                                
                     .... We will take gradients of this objective with respect to                         
                            θ
                        
                     and                         
                            ϑ
                        
                    " and "We adopt a continual perspective on the problem of learning options. ... [W]e focus on learning option policies and termination functions, assuming they are represented using differentiable parameterized function approximators. ¶ ... Let ...                         
                            
                                    β
                                
                                    ω
                                    ,
                                    ϑ
                                
                     [denote] the termination function of                         
                            ω
                        
                     parameterized by                         
                            ϑ
                        
                    ") based on the agents interaction with the environment under control of the respective option policy neural network (Bacon, p. 3, Algorithms and Architecture, Figure 1, "Diagram of the option-critic architecture. The option execution model is depicted by a switch                         
                            ⊥
                        
                     over the contacts                         
                            ⊸
                        
                    . A new option is selected according to                         
                            
                                    π
                                
                                    Ω
                                
                     only when the current option terminates," depicting Environment and observation                         
                            
                                    s
                                
                                    t
                                
                    ).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the Florensa/Henderson/Bacon combination regarding a set of option termination neural networks, one for each respective option policy neural network, with the further teachings of Bacon regarding training the option termination neural networks using the task rewards in a gradient training technique in which parameter values of the option termination neural network are adjusted to optimize a return from the environment based on the agents interaction with the environment under control of the respective option policy neural network.
The motivation to do so would be to facilitate scenarios where option policies are trained by distilling all experience available from the environment (Bacon, p. 2, Learning Options: "We adopt a continual perspective on the problem of learning options. At any time, we would like to distill all of the available experience into every component of our system: value function and policy over options, intra-option policies and termination functions. To achieve this goal, we focus on learning option policies and termination functions, assuming they are represented using differentiable parameterized function approximators").
The Florensa/Henderson/Bacon combination teaches the system being configured to train the option termination neural networks using the task rewards in a gradient training technique.
The Florensa/Henderson/Bacon combination may not explicitly teach training the option ... neural networks using the task rewards in a meta-gradient training technique.
However, Xu teaches:
train the option ... neural networks using the task rewards in a meta-gradient training technique (Xu, p. 3, 1.1 Applying Meta-Gradients to Returns: "we view the return                         
                            g
                        
                     as a function parameterised by meta-parameters                         
                            η
                        
                    , which may be differentiated to understand its dependence on                         
                            η
                        
                    . This in turn allows us to compute the gradient                         
                            ∂
                            f
                            /
                            ∂
                            η
                        
                     of the update function with respect to the meta-parameters                         
                            η
                        
                    , and hence the meta-gradient                         
                            ∂
                            J
                            '
                            
                                    τ
                                    '
                                    ,
                                    θ
                                    '
                                    ,
                                    η
                                    '
                                
                            /
                            ∂
                            η
                        
                    ," where Xu's return as a cumulative reward corresponds to the instant reward).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the Florensa/Henderson/Bacon combination regarding the system being configured to train the option termination neural networks using the task rewards in a gradient training technique with those of Xu regarding train the option neural networks using the task rewards in a meta-gradient training technique.
The motivation to do so would be to facilitate a reinforcement learning where learning parameters are adjusted during training based on model performance (Xu, p. 3, 1.1 Applying Meta-Gradients to Returns: "A typical RL algorithm would hand-select the meta-parameters, such as the discount factor                         
                            γ
                        
                     and bootstrapping parameter                         
                            λ
                        
                    , and these would be held fixed throughout training. ... In essence, our agent asks itself the question, 'which return results in the best performance?', and adjusts its meta-parameters accordingly").

Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Florensa, et al., "Stochastic neural networks for hierarchical reinforcement learning" (hereinafter "Florensa") in view of Henderson, et al., "OptionGAN: Learning joint reward-policy options using generative adversarial inverse reinforcement learning" (hereinafter "Henderson") in view of Bacon, et al., "The option-critic architecture" (hereinafter "Bacon") in view of Han, et al. "Multi-agent hierarchical reinforcement learning with dynamic termination."
Regarding Claim 15, the rejection of Claim 14 is incorporated.
The Florensa/Henderson/Bacon combination teaches training the manager neural network dependent on a switching cost.
The Florensa/Henderson/Bacon combination does not explicitly teach wherein the switching cost is configured to reduce the task reward or return used to update the parameter values of the manager neural network.
However, Han teaches:
wherein the switching cost is configured to reduce the task reward or return used to update the parameter values of the manager neural network (Han, p. 8, 4 Method, Dynamic Option Termination: "agents that utilize the delayed Q-value of Equation 5 will make sub-optimal decisions whenever another agent terminates. To increase the predictability of agents, while allowing them to terminate flexibly when the task demands it, we propose to put a price                 
                    δ
                
             on the decision to terminate the current option. Option termination is therefore no longer hard-coded, but becomes part of the agent's policy, which we call dynamic termination," where Han's price                 
                    δ
                
             corresponds to the instant reward-reducing cost).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the Florensa/Henderson/Bacon combination regarding training the manager neural network dependent on a switching cost with those of Han regarding the switching cost being configured to reduce the task reward or return used to update the parameter values of the manager neural network.
The motivation to do so would be to support processing scenarios where an agent's flexibility and predictability are balanced according to changes in environment (Han, p. 3, 1 Introduction: "We will refer to an agent's flexibility as the ability to switch options in response to changes in others or the environment. Furthermore, we will use predictability to measure how far an agent will commit to its broadcast option. In this paper, we propose an approach called dynamic termination, which allows an agent to choose whether to terminate its current option according to the state and others' options. This approach balances flexibility and predictability, combining the advantages of both"

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ROBERT N DAY whose telephone number is (703)756-1519. The examiner can normally be reached M-F 9-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached at (571) 272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/R.N.D./Examiner, Art Unit 2122                                                                                                                                                                                                        

/KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122
Read full office action
Prosecution Timeline

Oct 12, 2022
Application Filed
Nov 14, 2025
Non-Final Rejection mailed — §101, §103, §112
Feb 03, 2026
Interview Requested
Feb 10, 2026
Response Filed
Feb 10, 2026
Examiner Interview Summary
Feb 10, 2026
Applicant Interview (Telephonic)
May 26, 2026
Final Rejection mailed — §101, §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/869,095
Patent 12632783
FEDERATED CONTINUAL LEARNING
3y 10m to grant Granted May 19, 2026
17/195,116
Patent 12406181
METHOD, DEVICE, AND COMPUTER PROGRAM PRODUCT FOR UPDATING MODEL
4y 5m to grant Granted Sep 02, 2025
17/155,997
Patent 12229685
MODEL SUITABILITY COEFFICIENTS BASED ON GENERATIVE ADVERSARIAL NETWORKS AND ACTIVATION MAPS
4y 0m to grant Granted Feb 18, 2025
Study what changed to get past this examiner. Based on 3 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
21%
Grant Probability
41%
With Interview (+20.0%)
4y 1m (~5m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 24 resolved cases by this examiner. Grant probability derived from career allowance rate.
LEARNING OPTIONS FOR ACTION SELECTION WITH META-GRADIENTS IN MULTI-TASK REINFORCEMENT LEARNING

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

LEARNING OPTIONS FOR ACTION SELECTION WITH META-GRADIENTS IN MULTI-TASK REINFORCEMENT LEARNING

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email