Office Action Analysis: 18025304 — CONTROLLING AGENTS USING RELATIVE VARIATIONAL INTRINSIC CONTROL

Examiner Intelligence

SIPPEL, MOLLY CLARKE View full profile →
Grants 50% of resolved cases
Career Allow Rate
7 granted / 14 resolved
-5.0% vs TC avg
Strong +58% interview lift
Without
With
+58.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 7m
Avg Prosecution
25 currently pending
Career history
39
Total Applications
across all art units
Statute-Specific Performance

§101
33.8%
-6.2% vs TC avg
§103
32.0%
-8.0% vs TC avg
§102
9.8%
-30.2% vs TC avg
§112
23.6%
-16.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 14 resolved cases
Office Action

§101 §102 §103
DETAILED ACTION
	This action is responsive to the application and preliminary amendment filed on 03/08/2023. Claims 1-14 and 16-21 are pending in the case. Claims 3-4, 7-8, 11-13, and 16 are currently amended in the case. Claims 1, 16, and 17 are independent claims. 

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Acknowledgement is made of applicant’s claim for domestic priority based on international application no. PCT/EP2021/074892 filed 09/10/2021, which claims priority to provisional application no. 63/076876 filed 09/10/2020. 

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 09/02/2025 is being considered by the examiner.
	
The information disclosure statement filed 11/30/2023 fails to comply with 37 CFR 1.98(a)(2), which requires a legible copy of each cited foreign patent document; each non-patent literature publication or that portion which caused it to be listed; and all other information or that portion which caused it to be listed.  It has been placed in the application file, but the information referred to therein has not been considered.
The non-patent literature document titled “Information Maximization in Noisy Channels: A Variational Approach” has been stricken through and not considered because there was no legible copy of the document submitted. The non-patent literature documents titled “The arcade learning environment: An evaluation platform for general agents”, “Diversity is All You Need: Learning Skills without a Reward Function”, “Fast task inference with variational intrinsic successor features”, and “Adam: A Method for Stochastic Optimization” have been stricken through and not considered because the dates provided in the citations do not match the dates of the documents submitted. All other references are being considered by the examiner.
	
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-14 and 16-21 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 

	Regarding claim 1: 	Step 1 Statutory Category: Claim 1 is directed to a method, which falls under one of the four statutory categories. 
	Step 2A Prong 1 Judicial Exception: Claim 1 recites, in part, “generate a policy output that defines a control policy for controlling the agent”. This limitation, under the broadest reasonable interpretation, covers the recitation of mathematical concepts, see MPEP §2106.04(a)(2)(I). Further, the claim recites: “selecting a skill from the set of skills”. This limitation, under the broadest reasonable interpretation, covers the recitation of a mental process that can practically be performed in the human mind, with or without the use of a physical aid such as pen and paper (including an observation, evaluation, judgment, opinion), in this case a judgment. See MPEP § 2106.04(a)(2)(III). Further, the claim recites: “generating a trajectory… the trajectory comprising a sequence of observations received while the agent interacts with the environment”. This limitation, under the broadest reasonable interpretation, covers the recitation of mathematical concepts, see MPEP §2106.04(a)(2)(I). Further, the claim recites: “processing a relative input comprising (i) an initial observation in the sequence and (ii) a last observation in the sequence… process the relative input to generate a relative output that includes a respective relative score corresponding to each skill in the set of skills, each relative score representing an estimated likelihood that the policy neural network was conditioned on the corresponding skill while the trajectory was generated”. This limitation, under the broadest reasonable interpretation, covers the recitation of mathematical concepts, see MPEP §2106.04(a)(2)(I). Further, the claim recites: “processing an absolute input comprising the last observation in the sequence… process the absolute input to generate an absolute output that includes a respective absolute score corresponding to each skill in the set of skills, each absolute score representing an estimated likelihood that the policy neural network was conditioned on the corresponding skill while the trajectory was generated”. This limitation, under the broadest reasonable interpretation, covers the recitation of mathematical concepts, see MPEP §2106.04(a)(2)(I). Further, the claim recites: “generating a reward for the trajectory from the absolute score corresponding to the selected skill and the relative score corresponding to the selected skill”. This limitation, under the broadest reasonable interpretation, covers the recitation of mathematical concepts, see MPEP §2106.04(a)(2)(I).
Step 2A Prong 2 Integration into a practical application: This judicial exception is not integrated into a practical application. In particular the claim recites: “a policy neural network for use in controlling an agent interacting with an environment”. This limitation is an additional element that generally links the use of the judicial exception to a particular technological environment or field of use.  See MPEP §2106.05(h). Further, the claim recites: “wherein the policy neural network is configured to receive a policy input comprising an input observation characterizing a state of the environment and data identifying a skill from a set of skills”. This limitation amounts to mere data gathering. It is necessary to acquire the data in order to use the recited judicial exception. Therefore, this limitation is insignificant extra-solution activity to the judicial exception, see MPEP §2106.05(g). Further, the claim recites: “by controlling the agent using the policy neural network while the policy neural network is conditioned on the selected skill”, “while controlled using the policy neural network that is conditioned on the selected skill”, “using a relative discriminator neural network that is configured to…”, and “using an absolute discriminator neural network that is configured to…”. These limitations are additional elements that generally link the use of the judicial exception to a particular technological environment or field of use.  See MPEP §2106.05(h). Finally, the claim recites: “training the policy neural network on the reward for the trajectory”. This is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process.  See MPEP §2106.05(f).
Step 2B Significantly more: The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements: “a policy neural network for use in controlling an agent interacting with an environment”, “by controlling the agent using the policy neural network while the policy neural network is conditioned on the selected skill”, “while controlled using the policy neural network that is conditioned on the selected skill”, “using a relative discriminator neural network that is configured to…”, and “using an absolute discriminator neural network that is configured to…” generally link the use of the judicial exception to a particular technological environment or field of use. Elements that merely generally link the use of the judicial exception to a particular technological environment or field of use cannot provide an inventive concept. Further, the additional element “wherein the policy neural network is configured to receive a policy input comprising an input observation characterizing a state of the environment and data identifying a skill from a set of skills” is insignificant extra-solution activity to the judicial exception and is directed to receiving or transmitting data over a network which courts have recognized as well-understood, routine, and conventional when they are claimed in a generic manner, see MPEP §2106.05(d)(II). Finally, the additional element “training the policy neural network on the reward for the trajectory” amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. The claim is not patent eligible. 

Regarding claim 2, the rejection of claim 1 is incorporated, and further, the claim recites: “training the absolute discriminator neural network to optimize an objective function that encourages the absolute score corresponding to the selected skill to be increased”. This limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. The claim is not patent eligible. 

Regarding claim 3, the rejection of claim 1 is incorporated, and further, the claim recites: “training the relative discriminator neural network to optimize an objective function that encourages the relative score corresponding to the selected skill to be increased”. This limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. The claim is not patent eligible. 

Regarding claim 4, the rejection of claim 1 is incorporated, and further, the claim recites: “wherein the absolute discriminator neural network and the relative discriminator neural network share some parameters”. This limitation is an additional element that generally links the use of the judicial exception to a particular technological environment or field of use.  See MPEP §2106.05(h). Elements that merely generally link the use of the judicial exception to a particular technological environment or field of use cannot provide an inventive concept. The claim is not patent eligible.

Regarding claim 5, the rejection of claim 4 is incorporated, and further, the claim recites: “generates encoded representations of received observations”. This limitation recites mathematical concepts in addition to those identified in the rejection of the parent claim. Thus, the claim recites a judicial exception. 
Further, the claim recites: “wherein the absolute discriminator neural network and the relative discriminator neural network share an encoder neural network”. This limitation is an additional element that generally links the use of the judicial exception to a particular technological environment or field of use.  See MPEP §2106.05(h). Elements that merely generally link the use of the judicial exception to a particular technological environment or field of use cannot provide an inventive concept. The claim is not patent eligible. 

Regarding claim 6, the rejection of claim 5 is incorporated, and further, the claim recites: “process the encoded representation of the last observation to generate the absolute output”. This limitation recites mathematical concepts in addition to those identified in the rejection of the parent claim, thus the claim recites a judicial exception. 
Further, the claim recites: “wherein the absolute discriminator neural network comprises an absolute decoder neural network configured to…”. This limitation is an additional element that generally links the use of the judicial exception to a particular technological environment or field of use.  See MPEP §2106.05(h). Elements that merely generally link the use of the judicial exception to a particular technological environment or field of use cannot provide an inventive concept. The claim is not patent eligible. 

Regarding claim 7, the rejection of claim 5 is incorporated, and further, the claim recites: “process a concatenation of the encoded representations of the initial observation and the last observation to generate the relative output”. This limitation recites mathematical concepts in addition to those identified in the rejection of the parent claim, thus the claim recites a judicial exception. 
Further, the claim recites: “wherein the relative discriminator neural network comprises a relative decoder neural network configured to…”. This limitation is an additional element that generally links the use of the judicial exception to a particular technological environment or field of use.  See MPEP §2106.05(h). Elements that merely generally link the use of the judicial exception to a particular technological environment or field of use cannot provide an inventive concept. The claim is not patent eligible. 

Regarding claim 8, the rejection of claim 1 is incorporated, and further, the claim recites: “wherein training the policy neural network on the reward for the trajectory comprises training the neural network to maximize time discounted expected rewards for generated trajectories, and wherein: the reward rewards high relative scores and penalizes high absolute scores”. This limitation recites mathematical concepts in addition to those identified in the rejection of the parent claim, thus the claim recites a judicial exception. 
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 9, the rejection of claim 8 is incorporated, and further, the claim recites: “wherein the reward is equal to or directly proportional to a difference between the relative score corresponding to the selected skill and the absolute score corresponding to the selected skill”. This limitation recites the abstract idea of a mathematical relationship, as directed to “a mathematical relationship is a relationship between variables or numbers. A mathematical relationship may be expressed in words or using mathematical symbols”.  See MPEP § 2106.04(a)(2)(I)(A).
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 10, the rejection of claim 8 is incorporated, and further, the claim recites: “wherein the reward is equal to or directly proportional to a difference between a logarithm of the relative score corresponding to the selected skill and a logarithm of the absolute score corresponding to the selected skill”. This limitation recites the abstract idea of a mathematical relationship, as directed to “a mathematical relationship is a relationship between variables or numbers. A mathematical relationship may be expressed in words or using mathematical symbols”.  See MPEP § 2106.04(a)(2)(I)(A).
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 11, the rejection of claim 1 is incorporated, and further, the claim recites: “wherein selecting a skill from the set of skills comprises: sampling a skill from a uniform probability distribution over the set of skills”. This limitation recites mathematical concepts in addition to those identified in the rejection of the parent claim, thus the claim recites a judicial exception. 
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 12, the rejection of claim 1 is incorporated, and further, the claim recites: “wherein training the policy neural network on the reward for the trajectory comprises training the policy neural network through off-policy reinforcement learning”. This limitation is an additional element that generally links the use of the judicial exception to a particular technological environment or field of use.  See MPEP §2106.05(h). Elements that merely generally link the use of the judicial exception to a particular technological environment or field of use cannot provide an inventive concept. The claim is not patent eligible. 

Regarding claim 13, the rejection of claim 1 is incorporated, and further, the claim recites: “wherein generating the trajectory comprises generating the trajectory starting from a last state of the environment for a preceding trajectory, and wherein the initial observation characterizes the last state of the environment for the preceding trajectory”. This limitation is a continuation of the “generating a trajectory… the trajectory comprising a sequence of observations received while the agent interacts with the environment” limitation of the parent claim, and thus the claim recites a judicial exception. 
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 14, the rejection of claim 13 is incorporated, and further, the claim recites: “after generating the trajectory, determining whether criteria have been satisfied for resetting the environment; and in response to determining that the criteria are satisfied, selecting, as an initial state for a next trajectory to be generated, a state of the environment from a set of possible initial states of the environment”. This limitation recites mental processes in addition to those identified in the rejection of the parent claim, and thus the claim recites a judicial exception. 
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 16: 	Step 1 Statutory Category: Claim 16 is directed to a machine, which falls under one of the four statutory categories. 
	Step 2A Prong 1 Judicial Exception: Claim 16 recites, in part, “generate a policy output that defines a control policy for controlling the agent”. This limitation, under the broadest reasonable interpretation, covers the recitation of mathematical concepts, see MPEP §2106.04(a)(2)(I). Further, the claim recites: “selecting a skill from the set of skills”. This limitation, under the broadest reasonable interpretation, covers the recitation of a mental process that can practically be performed in the human mind, with or without the use of a physical aid such as pen and paper (including an observation, evaluation, judgment, opinion), in this case a judgment. See MPEP § 2106.04(a)(2)(III). Further, the claim recites: “generating a trajectory… the trajectory comprising a sequence of observations received while the agent interacts with the environment”. This limitation, under the broadest reasonable interpretation, covers the recitation of mathematical concepts, see MPEP §2106.04(a)(2)(I). Further, the claim recites: “processing a relative input comprising (i) an initial observation in the sequence and (ii) a last observation in the sequence… process the relative input to generate a relative output that includes a respective relative score corresponding to each skill in the set of skills, each relative score representing an estimated likelihood that the policy neural network was conditioned on the corresponding skill while the trajectory was generated”. This limitation, under the broadest reasonable interpretation, covers the recitation of mathematical concepts, see MPEP §2106.04(a)(2)(I). Further, the claim recites: “processing an absolute input comprising the last observation in the sequence… process the absolute input to generate an absolute output that includes a respective absolute score corresponding to each skill in the set of skills, each absolute score representing an estimated likelihood that the policy neural network was conditioned on the corresponding skill while the trajectory was generated”. This limitation, under the broadest reasonable interpretation, covers the recitation of mathematical concepts, see MPEP §2106.04(a)(2)(I). Further, the claim recites: “generating a reward for the trajectory from the absolute score corresponding to the selected skill and the relative score corresponding to the selected skill”. This limitation, under the broadest reasonable interpretation, covers the recitation of mathematical concepts, see MPEP §2106.04(a)(2)(I).
Step 2A Prong 2 Integration into a practical application: This judicial exception is not integrated into a practical application. In particular the claim recites: “one or more non-transitory computer-readable storage media encoded with instructions that, when executed by one or more computers cause the one or more computers to perform first operations”. This limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process.  See MPEP §2106.05(f). Further, the claim recites: “a policy neural network for use in controlling an agent interacting with an environment”. This limitation is an additional element that generally links the use of the judicial exception to a particular technological environment or field of use.  See MPEP §2106.05(h). Further, the claim recites: “wherein the policy neural network is configured to receive a policy input comprising an input observation characterizing a state of the environment and data identifying a skill from a set of skills”. This limitation amounts to mere data gathering. It is necessary to acquire the data in order to use the recited judicial exception. Therefore, this limitation is insignificant extra-solution activity to the judicial exception, see MPEP §2106.05(g). Further, the claim recites: “by controlling the agent using the policy neural network while the policy neural network is conditioned on the selected skill”, “while controlled using the policy neural network that is conditioned on the selected skill”, “using a relative discriminator neural network that is configured to…”, and “using an absolute discriminator neural network that is configured to…”. These limitations are additional elements that generally link the use of the judicial exception to a particular technological environment or field of use.  See MPEP §2106.05(h). Finally, the claim recites: “training the policy neural network on the reward for the trajectory”. This is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process.  See MPEP §2106.05(f).
Step 2B Significantly more: The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements: “a policy neural network for use in controlling an agent interacting with an environment”, “by controlling the agent using the policy neural network while the policy neural network is conditioned on the selected skill”, “while controlled using the policy neural network that is conditioned on the selected skill”, “using a relative discriminator neural network that is configured to…”, and “using an absolute discriminator neural network that is configured to…” generally link the use of the judicial exception to a particular technological environment or field of use. Elements that merely generally link the use of the judicial exception to a particular technological environment or field of use cannot provide an inventive concept. Further, the additional element “wherein the policy neural network is configured to receive a policy input comprising an input observation characterizing a state of the environment and data identifying a skill from a set of skills” is insignificant extra-solution activity to the judicial exception and is directed to receiving or transmitting data over a network which courts have recognized as well-understood, routine, and conventional when they are claimed in a generic manner, see MPEP §2106.05(d)(II). Finally, the additional elements “one or more non-transitory computer-readable storage media encoded with instructions that, when executed by one or more computers cause the one or more computers to perform first operations” and “training the policy neural network on the reward for the trajectory” amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. The claim is not patent eligible. 

Regarding claim 17: 	Step 1 Statutory Category: Claim 17 is directed to a machine, which falls under one of the four statutory categories. 
	Step 2A Prong 1 Judicial Exception: Claim 17 recites, in part, “generate a policy output that defines a control policy for controlling the agent”. This limitation, under the broadest reasonable interpretation, covers the recitation of mathematical concepts, see MPEP §2106.04(a)(2)(I). Further, the claim recites: “selecting a skill from the set of skills”. This limitation, under the broadest reasonable interpretation, covers the recitation of a mental process that can practically be performed in the human mind, with or without the use of a physical aid such as pen and paper (including an observation, evaluation, judgment, opinion), in this case a judgment. See MPEP § 2106.04(a)(2)(III). Further, the claim recites: “generating a trajectory… the trajectory comprising a sequence of observations received while the agent interacts with the environment”. This limitation, under the broadest reasonable interpretation, covers the recitation of mathematical concepts, see MPEP §2106.04(a)(2)(I). Further, the claim recites: “processing a relative input comprising (i) an initial observation in the sequence and (ii) a last observation in the sequence… process the relative input to generate a relative output that includes a respective relative score corresponding to each skill in the set of skills, each relative score representing an estimated likelihood that the policy neural network was conditioned on the corresponding skill while the trajectory was generated”. This limitation, under the broadest reasonable interpretation, covers the recitation of mathematical concepts, see MPEP §2106.04(a)(2)(I). Further, the claim recites: “processing an absolute input comprising the last observation in the sequence… process the absolute input to generate an absolute output that includes a respective absolute score corresponding to each skill in the set of skills, each absolute score representing an estimated likelihood that the policy neural network was conditioned on the corresponding skill while the trajectory was generated”. This limitation, under the broadest reasonable interpretation, covers the recitation of mathematical concepts, see MPEP §2106.04(a)(2)(I). Further, the claim recites: “generating a reward for the trajectory from the absolute score corresponding to the selected skill and the relative score corresponding to the selected skill”. This limitation, under the broadest reasonable interpretation, covers the recitation of mathematical concepts, see MPEP §2106.04(a)(2)(I).
Step 2A Prong 2 Integration into a practical application: This judicial exception is not integrated into a practical application. In particular the claim recites: “a system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform first operations”. This limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process.  See MPEP §2106.05(f). Further, the claim recites: “a policy neural network for use in controlling an agent interacting with an environment”. This limitation is an additional element that generally links the use of the judicial exception to a particular technological environment or field of use.  See MPEP §2106.05(h). Further, the claim recites: “wherein the policy neural network is configured to receive a policy input comprising an input observation characterizing a state of the environment and data identifying a skill from a set of skills”. This limitation amounts to mere data gathering. It is necessary to acquire the data in order to use the recited judicial exception. Therefore, this limitation is insignificant extra-solution activity to the judicial exception, see MPEP §2106.05(g). Further, the claim recites: “by controlling the agent using the policy neural network while the policy neural network is conditioned on the selected skill”, “while controlled using the policy neural network that is conditioned on the selected skill”, “using a relative discriminator neural network that is configured to…”, and “using an absolute discriminator neural network that is configured to…”. These limitations are additional elements that generally link the use of the judicial exception to a particular technological environment or field of use.  See MPEP §2106.05(h). Finally, the claim recites: “training the policy neural network on the reward for the trajectory”. This is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process.  See MPEP §2106.05(f).
Step 2B Significantly more: The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements: “a policy neural network for use in controlling an agent interacting with an environment”, “by controlling the agent using the policy neural network while the policy neural network is conditioned on the selected skill”, “while controlled using the policy neural network that is conditioned on the selected skill”, “using a relative discriminator neural network that is configured to…”, and “using an absolute discriminator neural network that is configured to…” generally link the use of the judicial exception to a particular technological environment or field of use. Elements that merely generally link the use of the judicial exception to a particular technological environment or field of use cannot provide an inventive concept. Further, the additional element “wherein the policy neural network is configured to receive a policy input comprising an input observation characterizing a state of the environment and data identifying a skill from a set of skills” is insignificant extra-solution activity to the judicial exception and is directed to receiving or transmitting data over a network which courts have recognized as well-understood, routine, and conventional when they are claimed in a generic manner, see MPEP §2106.05(d)(II). Finally, the additional elements “a system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform first operations” and “training the policy neural network on the reward for the trajectory” amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. The claim is not patent eligible. 

Regarding claim 18, the rejection of claim 17 is incorporated, and further, claim 18 is substantially similar to claim 2 respectively, and is rejected in the same manner and reasoning applying. 

Regarding claim 19, the rejection of claim 17 is incorporated, and further, claim 19 is substantially similar to claim 3 respectively, and is rejected in the same manner and reasoning applying. 

Regarding claim 20, the rejection of claim 17 is incorporated, and further, claim 20 is substantially similar to claim 4 respectively, and is rejected in the same manner and reasoning applying. 

Regarding claim 21, the rejection of claim 20 is incorporated and further, claim 21 is substantially similar to claim 5 respectively, and is rejected in the same manner and reasoning applying. 

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claims 1-4, 8-13, and 16-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by He et al., Skill Discovery of Coordination in Multi-agent Reinforcement Learning, 06/07/2020, https://arxiv.org/pdf/2006.04021, hereinafter referred to as “He”.

	Regarding claim 1, He teaches A method for training a policy neural network for use in controlling an agent interacting with an environment, wherein the policy neural network is configured to receive a policy input comprising an input observation characterizing a state of the environment and data identifying a skill from a set of skills and to generate a policy output that defines a control policy for controlling the agent (He, Page 5, Lines 1-2, “Multi-agent deep deterministic policy gradient (MADDPG) is an actor-critic MARL algorithm, composed of N actors with policy                         
                            
                                            π
                                        
                                            θ
                                        
                                    i
                                
                            (
                            u
                            |
                            x
                            )
                        
                     and N critics                         
                            
                                            Q
                                        
                                            ψ
                                        
                                    i
                                
                            (
                            x
                            ,
                             
                            u
                            )
                        
                    ”; see also He, Page 5, Algorithm 1: MASD, 3-8, “for each episode do / Sample a skill z ~ p(z) / Get observations                         
                            
                                    x
                                
                                    0
                                
                     for all agents / for each time step t do /                         
                            
                                    u
                                
                                    t
                                
                                    (
                                    i
                                    )
                                
                            ~
                            
                                            π
                                        
                                            θ
                                        
                                    i
                                
                            (
                            u
                            |
                            x
                            )
                        
                     / Apply                         
                            
                                    u
                                
                                    t
                                
                     to the environment and get observations                         
                            
                                    x
                                
                                    t
                                    +
                                    1
                                
                    ”; see also He, Page 4, Figure 1 which shows “z” which is considered the “data identifying a skill from a set of skills” used as input to the policy), the method comprising repeatedly performing operations (He, Page 5, Algorithm 1: MASD, The algorithm is performed within a for loop and is thus considered to be “repeatedly” performed) comprising:
selecting a skill from the set of skills (He, Page 5, Algorithm 1: MASD, Line 4, “Sample a skill z ~ p(z)”);
generating a trajectory by controlling the agent using the policy neural network while the policy neural network is conditioned on the selected skill, the trajectory comprising a sequence of observations received while the agent interacts with the environment while controlled using the policy neural network that is conditioned on the selected skill (He, Page 5, Algorithm 1: MASD, 3-8, “for each episode do / Sample a skill z ~ p(z) / Get observations                         
                            
                                    x
                                
                                    0
                                
                     for all agents / for each time step t do /                         
                            
                                    u
                                
                                    t
                                
                                    (
                                    i
                                    )
                                
                            ~
                            
                                            π
                                        
                                            θ
                                        
                                    i
                                
                            (
                            u
                            |
                            x
                            )
                        
                     / Apply                         
                            
                                    u
                                
                                    t
                                
                     to the environment and get observations                         
                            
                                    x
                                
                                    t
                                    +
                                    1
                                
                    ”; The “agents” are considered to be the “agent”, and the “observations                         
                            
                                    x
                                
                                    t
                                    +
                                    1
                                
                    ” are considered to be the “trajectory”; He, Page 3, Section 3, Paragraph 2, Lines 1-2, “Inspired by other unsupervised skill discovery methods in single agent RL, the straightforward way is letting the policy conditioned on a sampled latent variable z shared by all agents in each episode”; He, Page 3, Section 3, Paragraph 3, Lines 3-4, “the sampled skill controls the set of states visited by multiple agents”; see also He, Page 4, Figure 1 which shows the policy controlling the agent in the environment);
processing a relative input comprising (i) an initial observation in the sequence and (ii) a last observation in the sequence using a relative discriminator neural network that is configured to process the relative input to generate a relative output that includes a respective relative score corresponding to each skill in the set of skills, each relative score representing an estimated likelihood that the policy neural network was conditioned on the corresponding skill while the trajectory was generated (He, Page 4, Figure 1, The “global discriminator” is considered to be the “relative discriminator”, which takes all the observations, “                        
                            
                                    x
                                
                                    t
                                    +
                                    1
                                
                    ” as input and thus rakes “an initial observation in the sequence” and “a last observation in the sequence” as input; He, Page He, Page 5, Lines 5-8, “At the same time, we train the global discriminator and N local discriminators with supervision loss. The overall structure is depicted in Fig. 1. Notice that the latent space can either be continuous or categorical. When z is sampled from a k-category uniform distribution, the discriminator is equipped with categorical cross entropy loss” In order to use “categorical entropy loss” a person of ordinary skill in the art would recognize the output of the discriminator must be “a respective relative score corresponding to each skill in the set of skills”);
processing an absolute input comprising the last observation in the sequence using an absolute discriminator neural network that is configured to process the absolute input to generate an absolute output that includes a respective absolute score corresponding to each skill in the set of skills, each absolute score representing an estimated likelihood that the policy neural network was conditioned on the corresponding skill while the trajectory was generated (He, Page 4, Figure 1, The “local discriminator” is considered to be the “absolute discriminator”, and the final one takes “                        
                            
                                    x
                                
                                    t
                                    +
                                    1
                                
                                    (
                                    N
                                    )
                                
                    ” as input, which is the “last observation in the sequence”; He, Page He, Page 5, Lines 5-8, “At the same time, we train the global discriminator and N local discriminators with supervision loss. The overall structure is depicted in Fig. 1. Notice that the latent space can either be continuous or categorical. When z is sampled from a k-category uniform distribution, the discriminator is equipped with categorical cross entropy loss” In order to use “categorical entropy loss” a person of ordinary skill in the art would recognize the output of the discriminator must be “a respective relative score corresponding to each skill in the set of skills”);
generating a reward for the trajectory from the absolute score corresponding to the selected skill and the relative score corresponding to the selected skill (He, Page 4, Equation 7, “                        
                            
                                    r
                                
                                    z
                                
                            =
                            l
                            o
                            g
                            
                                    q
                                
                                    ϕ
                                
                                    z
                                
                                    f
                                    
                                            x
                                        
                            -
                            
                                    1
                                
                                    N
                                
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                        N
                                    
                                    l
                                    o
                                    g
                                    
                                                    q
                                                
                                                    ϕ
                                                
                                            i
                                        
                                    (
                                    z
                                    |
                                    f
                                    
                                                    x
                                                
                                                            i
                                                        
                                    )
                                
                    ”; “                        
                            
                                    r
                                
                                    z
                                
                    ” is considered to be the “reward”; “                        
                            
                                    q
                                
                                    ϕ
                                
                                    z
                                
                                    f
                                    
                                            x
                                        
                    ” is considered to be the “relative score corresponding to the selected skill” and “                        
                            
                                            q
                                        
                                            ϕ
                                        
                                    i
                                
                            (
                            z
                            |
                            f
                            
                                            x
                                        
                                                    i
                                                
                            )
                        
                    ” is considered to be the “absolute score corresponding to the selected skill”); and
training the policy neural network on the reward for the trajectory (He, Page 4, Line 11 and Equation 7, “Therefore we feed the multi-agent policies with pseudo reward                         
                            
                                    r
                                
                                    z
                                
                            =
                            l
                            o
                            g
                            
                                    q
                                
                                    ϕ
                                
                                    z
                                
                                    f
                                    
                                            x
                                        
                            -
                            
                                    1
                                
                                    N
                                
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                        N
                                    
                                    l
                                    o
                                    g
                                    
                                                    q
                                                
                                                    ϕ
                                                
                                            i
                                        
                                    (
                                    z
                                    |
                                    f
                                    
                                                    x
                                                
                                                            i
                                                        
                                    )
                                
                    ”; see also He, Page 5, Algorithm 1: MASD, Lines 9-16).

	Regarding claim 2, the rejection of claim 1 is incorporated, and further, He teaches training the absolute discriminator neural network to optimize an objective function that encourages the absolute score corresponding to the selected skill to be increased (He, Page 5, Lines 5-8, “At the same time, we train the global discriminator and N local discriminators with supervision loss. The overall structure is depicted in Fig. 1. Notice that the latent space can either be continuous or categorical. When z is sampled from a k-category uniform distribution, the discriminator is equipped with categorical cross entropy loss”; He, Page 5, Algorithm 1: MASD, Lines 17-19; A person of ordinary skill in the art would recognize that the “categorical cross entropy loss” function is considered to be “an objective function” and updating the model with this loss encourages the error to be lower, encouraging the “absolute score corresponding to the selected skill to be increased”). 
	Regarding claim 3, the rejection of claim 1 is incorporated, and further, He teaches training the relative discriminator neural network to optimize an objective function that encourages the relative score corresponding to the selected skill to be increased (He, Page 5, Lines 5-8, “At the same time, we train the global discriminator and N local discriminators with supervision loss. The overall structure is depicted in Fig. 1. Notice that the latent space can either be continuous or categorical. When z is sampled from a k-category uniform distribution, the discriminator is equipped with categorical cross entropy loss”; He, Page 5, Algorithm 1: MASD, Lines 17-19; A person of ordinary skill in the art would recognize that the “categorical cross entropy loss” function is considered to be “an objective function” and updating the model with this loss encourages the error to be lower, encouraging the “absolute score corresponding to the selected skill to be increased”).

	Regarding claim 4, the rejection of claim 1 is incorporated, and further, He teaches wherein the absolute discriminator neural network and the relative discriminator neural network share some parameters (He, Page 4, Figure 1, The global discriminator, which is considered to be the relative discriminator, receives the entire trajectory as input, and the final local discriminator, which is considered to be the absolute discriminator, receives the last observation as input, thus the discriminators share input parameters).

	Regarding claim 8, the rejection of claim 1 is incorporated, and further, He teaches training the policy neural network on the reward for the trajectory comprises training the neural network to maximize time discounted expected rewards for generated trajectories, and wherein: the reward rewards high relative scores and penalizes high absolute scores (He, Page 4, Line 11 and Equation 7, “Therefore we feed the multi-agent policies with pseudo reward                         
                            
                                    r
                                
                                    z
                                
                            =
                            l
                            o
                            g
                            
                                    q
                                
                                    ϕ
                                
                                    z
                                
                                    f
                                    
                                            x
                                        
                            -
                            
                                    1
                                
                                    N
                                
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                        N
                                    
                                    l
                                    o
                                    g
                                    
                                                    q
                                                
                                                    ϕ
                                                
                                            i
                                        
                                    (
                                    z
                                    |
                                    f
                                    
                                                    x
                                                
                                                            i
                                                        
                                    )
                                
                    ”; High relative scores are rewarded because it increases the overall pseudo reward, and absolute scores are penalized as they decrease the overall pseudo reward; see also He, Page 5, Algorithm 1: MASD, Lines 9-16; He, Page 4, Figure 1 Description, Lines 1-2, “Agents receive pseudo reward computed by discriminators to improve their coordinating skills”; He, Page 2, Section 2.1, Lines 9-10, “In supervised and decentralized scenarios, the agents improve their policies                         
                            
                                    π
                                
                                    i
                                
                            (
                            
                                    u
                                
                                            i
                                        
                            |
                            
                                    x
                                
                                            i
                                        
                            )
                        
                     to maximize their collective expected accumulative discounted reward                         
                            
                                    E
                                
                                            s
                                        
                                            0
                                        
                                    ,
                                     
                                    π
                                    ,
                                     
                                    P
                                
                            [
                            
                                    ∑
                                    
                                        t
                                        =
                                        0
                                    
                                        T
                                    
                                            γ
                                        
                                            t
                                        
                                            R
                                        
                                            t
                                        
                                    ]
                                
                    ”). 

	Regarding claim 9, the rejection of claim 8 is incorporated, and further, He teaches wherein the reward is equal to or directly proportional to a difference between the relative score corresponding to the selected skill and the absolute score corresponding to the selected skill (He, Page 4, Line 11 and Equation 7, “Therefore we feed the multi-agent policies with pseudo reward                         
                            
                                    r
                                
                                    z
                                
                            =
                            l
                            o
                            g
                            
                                    q
                                
                                    ϕ
                                
                                    z
                                
                                    f
                                    
                                            x
                                        
                            -
                            
                                    1
                                
                                    N
                                
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                        N
                                    
                                    l
                                    o
                                    g
                                    
                                                    q
                                                
                                                    ϕ
                                                
                                            i
                                        
                                    (
                                    z
                                    |
                                    f
                                    
                                                    x
                                                
                                                            i
                                                        
                                    )
                                
                    ”; “                        
                            
                                    q
                                
                                    ϕ
                                
                                    z
                                
                                    f
                                    
                                            x
                                        
                    ” is considered to be the “relative score corresponding to the selected skill” and “                        
                            (
                            z
                            |
                            f
                            
                                            x
                                        
                                                    i
                                                
                            )
                        
                    ” is considered to be the “absolute score corresponding to the selected skill”).

	Regarding claim 10, the rejection of claim 8 is incorporated, and further, He teaches wherein the reward is equal to or directly proportional to a difference between a logarithm of the relative score corresponding to the selected skill and a logarithm of the absolute score corresponding to the selected skill (He, Page 4, Line 11 and Equation 7, “Therefore we feed the multi-agent policies with pseudo reward                         
                            
                                    r
                                
                                    z
                                
                            =
                            l
                            o
                            g
                            
                                    q
                                
                                    ϕ
                                
                                    z
                                
                                    f
                                    
                                            x
                                        
                            -
                            
                                    1
                                
                                    N
                                
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                        N
                                    
                                    l
                                    o
                                    g
                                    
                                                    q
                                                
                                                    ϕ
                                                
                                            i
                                        
                                    (
                                    z
                                    |
                                    f
                                    
                                                    x
                                                
                                                            i
                                                        
                                    )
                                
                    ”; “                        
                            
                                    q
                                
                                    ϕ
                                
                                    z
                                
                                    f
                                    
                                            x
                                        
                    ” is considered to be the “relative score corresponding to the selected skill” and “                        
                            (
                            z
                            |
                            f
                            
                                            x
                                        
                                                    i
                                                
                            )
                        
                    ” is considered to be the “absolute score corresponding to the selected skill”).

	Regarding claim 11, the rejection of claim 1 is incorporated, and further, He teaches wherein selecting a skill from the set of skills comprises: sampling a skill from a uniform probability distribution over the set of skills (He, Page 5, Algorithm 1: MASD, Line 4, “Sample a skill z ~ p(z)”; He, Page 5, Lines 6-8, “Notice that the latent space can either be continuous or categorical. When z is sampled from a k-category uniform distribution, the discriminator is equipped with categorical cross entropy loss”). 

	Regarding claim 12, the rejection of claim 1 is incorporated, and further, He teaches wherein training the policy neural network on the reward for the trajectory comprises training the policy neural network through off-policy reinforcement learning (He, Page 5, Algorithm 1: MASD; He, Page 5, Lines 1-2, Multi-agent deep deterministic policy gradient (MADDPG) is an actor-critic MARL algorithm, composed of N actors with policy                         
                            
                                            π
                                        
                                            θ
                                        
                                    i
                                
                            (
                            u
                            |
                            x
                            )
                        
                     and N critics                         
                            
                                            Q
                                        
                                            ψ
                                        
                                    i
                                
                            (
                            x
                            ,
                             
                            u
                            )
                        
                    ”; “actor-critic” algorithms are considered to be “off-policy reinforcement learning”).

	Regarding claim 13, the rejection of claim 1 is incorporated, and further, He teaches wherein generating the trajectory comprises generating the trajectory starting from a last state of the environment for a preceding trajectory, and wherein the initial observation characterizes the last state of the environment for the preceding trajectory (He, Page 5, Algorithm 1: MASD; A person of ordinary skill in the art would recognize that because the algorithm is in a for loop, the next iteration takes place from where the last ended, which is considered to be “generating the trajectory starting from a last state of the environment for a proceeding trajectory”; see also He, Page 2, Section 2.1).

	Regarding claim 16, He teaches One or more non-transitory computer-readable storage media encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform first operations (He, Page 5-8, Section 4, A person of ordinary skill in the art would recognize that the experiments and the collection of results represented by Figures 2-5 would be performed on a computer, which provides evidence for “one or more non-transitory computer-readable storage media”, “instructions”, and the “one or more computers”) for training a policy neural network for use in controlling an agent interacting with an environment, wherein the policy neural network is configured to receive a policy input comprising an input observation characterizing a state of the environment and data identifying a skill from a set of skills and to generate a policy output that defines a control policy for controlling the agent (He, Page 5, Lines 1-2, “Multi-agent deep deterministic policy gradient (MADDPG) is an actor-critic MARL algorithm, composed of N actors with policy                         
                            
                                            π
                                        
                                            θ
                                        
                                    i
                                
                            (
                            u
                            |
                            x
                            )
                        
                     and N critics                         
                            
                                            Q
                                        
                                            ψ
                                        
                                    i
                                
                            (
                            x
                            ,
                             
                            u
                            )
                        
                    ”; see also He, Page 5, Algorithm 1: MASD, 3-8, “for each episode do / Sample a skill z ~ p(z) / Get observations                         
                            
                                    x
                                
                                    0
                                
                     for all agents / for each time step t do /                         
                            
                                    u
                                
                                    t
                                
                                    (
                                    i
                                    )
                                
                            ~
                            
                                            π
                                        
                                            θ
                                        
                                    i
                                
                            (
                            u
                            |
                            x
                            )
                        
                     / Apply                         
                            
                                    u
                                
                                    t
                                
                     to the environment and get observations                         
                            
                                    x
                                
                                    t
                                    +
                                    1
                                
                    ”; see also He, Page 4, Figure 1 which shows “z” which is considered the “data identifying a skill from a set of skills” used as input to the policy), the first operations comprising repeatedly performing second operations (He, Page 5, Algorithm 1: MASD, The algorithm is performed within a for loop and is thus considered to be “repeatedly” performed) comprising: 

selecting a skill from the set of skills (He, Page 5, Algorithm 1: MASD, Line 4, “Sample a skill z ~ p(z)”);
generating a trajectory by controlling the agent using the policy neural network while the policy neural network is conditioned on the selected skill, the trajectory comprising a sequence of observations received while the agent interacts with the environment while controlled using the policy neural network that is conditioned on the selected skill (He, Page 5, Algorithm 1: MASD, 3-8, “for each episode do / Sample a skill z ~ p(z) / Get observations                         
                            
                                    x
                                
                                    0
                                
                     for all agents / for each time step t do /                         
                            
                                    u
                                
                                    t
                                
                                    (
                                    i
                                    )
                                
                            ~
                            
                                            π
                                        
                                            θ
                                        
                                    i
                                
                            (
                            u
                            |
                            x
                            )
                        
                     / Apply                         
                            
                                    u
                                
                                    t
                                
                     to the environment and get observations                         
                            
                                    x
                                
                                    t
                                    +
                                    1
                                
                    ”; The “agents” are considered to be the “agent”, and the “observations                         
                            
                                    x
                                
                                    t
                                    +
                                    1
                                
                    ” are considered to be the “trajectory”; He, Page 3, Section 3, Paragraph 2, Lines 1-2, “Inspired by other unsupervised skill discovery methods in single agent RL, the straightforward way is letting the policy conditioned on a sampled latent variable z shared by all agents in each episode”; He, Page 3, Section 3, Paragraph 3, Lines 3-4, “the sampled skill controls the set of states visited by multiple agents”; see also He, Page 4, Figure 1 which shows the policy controlling the agent in the environment);
processing a relative input comprising (i) an initial observation in the sequence and (ii) a last observation in the sequence using a relative discriminator neural network that is configured to process the relative input to generate a relative output that includes a respective relative score corresponding to each skill in the set of skills, each relative score representing an estimated likelihood that the policy neural network was conditioned on the corresponding skill while the trajectory was generated (He, Page 4, Figure 1, The “global discriminator” is considered to be the “relative discriminator”, which takes all the observations, “                        
                            
                                    x
                                
                                    t
                                    +
                                    1
                                
                    ” as input and thus rakes “an initial observation in the sequence” and “a last observation in the sequence” as input; He, Page He, Page 5, Lines 5-8, “At the same time, we train the global discriminator and N local discriminators with supervision loss. The overall structure is depicted in Fig. 1. Notice that the latent space can either be continuous or categorical. When z is sampled from a k-category uniform distribution, the discriminator is equipped with categorical cross entropy loss” In order to use “categorical entropy loss” a person of ordinary skill in the art would recognize the output of the discriminator must be “a respective relative score corresponding to each skill in the set of skills”);
processing an absolute input comprising the last observation in the sequence using an absolute discriminator neural network that is configured to process the absolute input to generate an absolute output that includes a respective absolute score corresponding to each skill in the set of skills, each absolute score representing an estimated likelihood that the policy neural network was conditioned on the corresponding skill while the trajectory was generated (He, Page 4, Figure 1, The “local discriminator” is considered to be the “absolute discriminator”, and the final one takes “                        
                            
                                    x
                                
                                    t
                                    +
                                    1
                                
                                    (
                                    N
                                    )
                                
                    ” as input, which is the “last observation in the sequence”; He, Page He, Page 5, Lines 5-8, “At the same time, we train the global discriminator and N local discriminators with supervision loss. The overall structure is depicted in Fig. 1. Notice that the latent space can either be continuous or categorical. When z is sampled from a k-category uniform distribution, the discriminator is equipped with categorical cross entropy loss” In order to use “categorical entropy loss” a person of ordinary skill in the art would recognize the output of the discriminator must be “a respective relative score corresponding to each skill in the set of skills”);
generating a reward for the trajectory from the absolute score corresponding to the selected skill and the relative score corresponding to the selected skill (He, Page 4, Equation 7, “                        
                            
                                    r
                                
                                    z
                                
                            =
                            l
                            o
                            g
                            
                                    q
                                
                                    ϕ
                                
                                    z
                                
                                    f
                                    
                                            x
                                        
                            -
                            
                                    1
                                
                                    N
                                
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                        N
                                    
                                    l
                                    o
                                    g
                                    
                                                    q
                                                
                                                    ϕ
                                                
                                            i
                                        
                                    (
                                    z
                                    |
                                    f
                                    
                                                    x
                                                
                                                            i
                                                        
                                    )
                                
                    ”; “                        
                            
                                    r
                                
                                    z
                                
                    ” is considered to be the “reward”; “                        
                            
                                    q
                                
                                    ϕ
                                
                                    z
                                
                                    f
                                    
                                            x
                                        
                    ” is considered to be the “relative score corresponding to the selected skill” and “                        
                            
                                            q
                                        
                                            ϕ
                                        
                                    i
                                
                            (
                            z
                            |
                            f
                            
                                            x
                                        
                                                    i
                                                
                            )
                        
                    ” is considered to be the “absolute score corresponding to the selected skill”); and
training the policy neural network on the reward for the trajectory (He, Page 4, Line 11 and Equation 7, “Therefore we feed the multi-agent policies with pseudo reward                         
                            
                                    r
                                
                                    z
                                
                            =
                            l
                            o
                            g
                            
                                    q
                                
                                    ϕ
                                
                                    z
                                
                                    f
                                    
                                            x
                                        
                            -
                            
                                    1
                                
                                    N
                                
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                        N
                                    
                                    l
                                    o
                                    g
                                    
                                                    q
                                                
                                                    ϕ
                                                
                                            i
                                        
                                    (
                                    z
                                    |
                                    f
                                    
                                                    x
                                                
                                                            i
                                                        
                                    )
                                
                    ”; see also He, Page 5, Algorithm 1: MASD, Lines 9-16).

	Regarding claim 17, He teaches A system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform first operations (He, Page 5-8, Section 4, A person of ordinary skill in the art would recognize that the experiments and the collection of results represented by Figures 2-5 would be performed on a computer, which provides evidence for “A system”, “one or more computers”, “one or more storage devices”, and “instructions”) for training a policy neural network for use in controlling an agent interacting with an environment, wherein the policy neural network is configured to receive a policy input comprising an input observation characterizing a state of the environment and data identifying a skill from a set of skills and to generate a policy output that defines a control policy for controlling the agent (He, Page 5, Lines 1-2, “Multi-agent deep deterministic policy gradient (MADDPG) is an actor-critic MARL algorithm, composed of N actors with policy                         
                            
                                            π
                                        
                                            θ
                                        
                                    i
                                
                            (
                            u
                            |
                            x
                            )
                        
                     and N critics                         
                            
                                            Q
                                        
                                            ψ
                                        
                                    i
                                
                            (
                            x
                            ,
                             
                            u
                            )
                        
                    ”; see also He, Page 5, Algorithm 1: MASD, 3-8, “for each episode do / Sample a skill z ~ p(z) / Get observations                         
                            
                                    x
                                
                                    0
                                
                     for all agents / for each time step t do /                         
                            
                                    u
                                
                                    t
                                
                                    (
                                    i
                                    )
                                
                            ~
                            
                                            π
                                        
                                            θ
                                        
                                    i
                                
                            (
                            u
                            |
                            x
                            )
                        
                     / Apply                         
                            
                                    u
                                
                                    t
                                
                     to the environment and get observations                         
                            
                                    x
                                
                                    t
                                    +
                                    1
                                
                    ”; see also He, Page 4, Figure 1 which shows “z” which is considered the “data identifying a skill from a set of skills” used as input to the policy), the first operations comprising repeatedly performing second operations (He, Page 5, Algorithm 1: MASD, The algorithm is performed within a for loop and is thus considered to be “repeatedly” performed) comprising:
selecting a skill from the set of skills (He, Page 5, Algorithm 1: MASD, Line 4, “Sample a skill z ~ p(z)”);
generating a trajectory by controlling the agent using the policy neural network while the policy neural network is conditioned on the selected skill, the trajectory comprising a sequence of observations received while the agent interacts with the environment while controlled using the policy neural network that is conditioned on the selected skill (He, Page 5, Algorithm 1: MASD, 3-8, “for each episode do / Sample a skill z ~ p(z) / Get observations                         
                            
                                    x
                                
                                    0
                                
                     for all agents / for each time step t do /                         
                            
                                    u
                                
                                    t
                                
                                    (
                                    i
                                    )
                                
                            ~
                            
                                            π
                                        
                                            θ
                                        
                                    i
                                
                            (
                            u
                            |
                            x
                            )
                        
                     / Apply                         
                            
                                    u
                                
                                    t
                                
                     to the environment and get observations                         
                            
                                    x
                                
                                    t
                                    +
                                    1
                                
                    ”; The “agents” are considered to be the “agent”, and the “observations                         
                            
                                    x
                                
                                    t
                                    +
                                    1
                                
                    ” are considered to be the “trajectory”; He, Page 3, Section 3, Paragraph 2, Lines 1-2, “Inspired by other unsupervised skill discovery methods in single agent RL, the straightforward way is letting the policy conditioned on a sampled latent variable z shared by all agents in each episode”; He, Page 3, Section 3, Paragraph 3, Lines 3-4, “the sampled skill controls the set of states visited by multiple agents”; see also He, Page 4, Figure 1 which shows the policy controlling the agent in the environment);
processing a relative input comprising (i) an initial observation in the sequence and (ii) a last observation in the sequence using a relative discriminator neural network that is configured to process the relative input to generate a relative output that includes a respective relative score corresponding to each skill in the set of skills, each relative score representing an estimated likelihood that the policy neural network was conditioned on the corresponding skill while the trajectory was generated (He, Page 4, Figure 1, The “global discriminator” is considered to be the “relative discriminator”, which takes all the observations, “                        
                            
                                    x
                                
                                    t
                                    +
                                    1
                                
                    ” as input and thus rakes “an initial observation in the sequence” and “a last observation in the sequence” as input; He, Page He, Page 5, Lines 5-8, “At the same time, we train the global discriminator and N local discriminators with supervision loss. The overall structure is depicted in Fig. 1. Notice that the latent space can either be continuous or categorical. When z is sampled from a k-category uniform distribution, the discriminator is equipped with categorical cross entropy loss” In order to use “categorical entropy loss” a person of ordinary skill in the art would recognize the output of the discriminator must be “a respective relative score corresponding to each skill in the set of skills”);
processing an absolute input comprising the last observation in the sequence using an absolute discriminator neural network that is configured to process the absolute input to generate an absolute output that includes a respective absolute score corresponding to each skill in the set of skills, each absolute score representing an estimated likelihood that the policy neural network was conditioned on the corresponding skill while the trajectory was generated (He, Page 4, Figure 1, The “local discriminator” is considered to be the “absolute discriminator”, and the final one takes “                        
                            
                                    x
                                
                                    t
                                    +
                                    1
                                
                                    (
                                    N
                                    )
                                
                    ” as input, which is the “last observation in the sequence”; He, Page He, Page 5, Lines 5-8, “At the same time, we train the global discriminator and N local discriminators with supervision loss. The overall structure is depicted in Fig. 1. Notice that the latent space can either be continuous or categorical. When z is sampled from a k-category uniform distribution, the discriminator is equipped with categorical cross entropy loss” In order to use “categorical entropy loss” a person of ordinary skill in the art would recognize the output of the discriminator must be “a respective relative score corresponding to each skill in the set of skills”);
generating a reward for the trajectory from the absolute score corresponding to the selected skill and the relative score corresponding to the selected skill (He, Page 4, Equation 7, “                        
                            
                                    r
                                
                                    z
                                
                            =
                            l
                            o
                            g
                            
                                    q
                                
                                    ϕ
                                
                                    z
                                
                                    f
                                    
                                            x
                                        
                            -
                            
                                    1
                                
                                    N
                                
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                        N
                                    
                                    l
                                    o
                                    g
                                    
                                                    q
                                                
                                                    ϕ
                                                
                                            i
                                        
                                    (
                                    z
                                    |
                                    f
                                    
                                                    x
                                                
                                                            i
                                                        
                                    )
                                
                    ”; “                        
                            
                                    r
                                
                                    z
                                
                    ” is considered to be the “reward”; “                        
                            
                                    q
                                
                                    ϕ
                                
                                    z
                                
                                    f
                                    
                                            x
                                        
                    ” is considered to be the “relative score corresponding to the selected skill” and “                        
                            
                                            q
                                        
                                            ϕ
                                        
                                    i
                                
                            (
                            z
                            |
                            f
                            
                                            x
                                        
                                                    i
                                                
                            )
                        
                    ” is considered to be the “absolute score corresponding to the selected skill”); and
training the policy neural network on the reward for the trajectory (He, Page 4, Line 11 and Equation 7, “Therefore we feed the multi-agent policies with pseudo reward                         
                            
                                    r
                                
                                    z
                                
                            =
                            l
                            o
                            g
                            
                                    q
                                
                                    ϕ
                                
                                    z
                                
                                    f
                                    
                                            x
                                        
                            -
                            
                                    1
                                
                                    N
                                
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                        N
                                    
                                    l
                                    o
                                    g
                                    
                                                    q
                                                
                                                    ϕ
                                                
                                            i
                                        
                                    (
                                    z
                                    |
                                    f
                                    
                                                    x
                                                
                                                            i
                                                        
                                    )
                                
                    ”; see also He, Page 5, Algorithm 1: MASD, Lines 9-16).

	Regarding claim 18, the rejection of claim 17 is incorporated, and further, He teaches training the absolute discriminator neural network to optimize an objective function that encourages the absolute score corresponding to the selected skill to be increased (He, Page 5, Lines 5-8, “At the same time, we train the global discriminator and N local discriminators with supervision loss. The overall structure is depicted in Fig. 1. Notice that the latent space can either be continuous or categorical. When z is sampled from a k-category uniform distribution, the discriminator is equipped with categorical cross entropy loss”; He, Page 5, Algorithm 1: MASD, Lines 17-19; A person of ordinary skill in the art would recognize that the “categorical cross entropy loss” function is considered to be “an objective function” and updating the model with this loss encourages the error to be lower, encouraging the “absolute score corresponding to the selected skill to be increased”). 

	Regarding claim 19, the rejection of claim 17 is incorporated, and further, He teaches training the relative discriminator neural network to optimize an objective function that encourages the relative score corresponding to the selected skill to be increased (He, Page 5, Lines 5-8, “At the same time, we train the global discriminator and N local discriminators with supervision loss. The overall structure is depicted in Fig. 1. Notice that the latent space can either be continuous or categorical. When z is sampled from a k-category uniform distribution, the discriminator is equipped with categorical cross entropy loss”; He, Page 5, Algorithm 1: MASD, Lines 17-19; A person of ordinary skill in the art would recognize that the “categorical cross entropy loss” function is considered to be “an objective function” and updating the model with this loss encourages the error to be lower, encouraging the “absolute score corresponding to the selected skill to be increased”).

	Regarding claim 20, the rejection of claim 17 is incorporated, and further, He teaches wherein the absolute discriminator neural network and the relative discriminator neural network share some parameters (He, Page 4, Figure 1, The global discriminator, which is considered to be the relative discriminator, receives the entire trajectory as input, and the final local discriminator, which is considered to be the absolute discriminator, receives the last observation as input, thus the discriminators share input parameters).

	Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over He in view of Eysenbach et al., Leave no Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning, 11/18/2017, https://arxiv.org/pdf/1711.06782, hereinafter referred to as “Eysenbach”. 

Regarding claim 14, the rejection of claim 13 is incorporated. 
He does not explicitly teach after generating the trajectory, determining whether criteria have been satisfied for resetting the environment and in response to determining that the criteria are satisfied, selecting, as an initial state for a next trajectory to be generated, a state of the environment from a set of possible initial states of the environment. 
Eysenbach teaches after generating the trajectory, determining whether criteria have been satisfied for resetting the environment (Eysenbach, Page 4, Section 4.3, Lines 1-2, “When running the forward policy, we perform an early abort if the Q value for the reset policy is less than                         
                            
                                    Q
                                
                                    m
                                    i
                                    n
                                
                    ”; see also Eysenbach, Page 4, Algorithm 1, Steps 3-5, which show this check is done after generating a trajectory) and
in response to determining that the criteria are satisfied, selecting, as an initial state for a next trajectory to be generated, a state of the environment from a set of possible initial states of the environment (Eysenbach, Page 2, Paragraph 2, Lines 5-6, “In this work, we learn two policies that alternate between attempting the task and resetting the environment”; Eysenbach, Page 3, Section 4.1, Line 1, “The reset policy learns how to transition from the forward policy’s final state back to an initial state”; see also Eysenbach, Page 3, Section 3, Paragraph 2).
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to have modified the policy neural network training method of He to include checking if the environment should be reset and returning to an initial state if so as taught by Eysenbach. The motivation to do so would have been that in a robotic setting, implementing a resetting mechanism reduces the number of manual resets required to learn a task and can reduce the number of non-reversible states visited (Eysenbach, Page 1, Abstract, Lines 9-14, “By learning a value function for the reset policy, we can automatically determine when the forward policy is about to enter a non-reversible state, providing for uncertainty-aware safety aborts. Our experiments illustrate that proper use of the reset policy can greatly reduce the number of manual resets required to learn a task, can reduce the number of unsafe actions that lead to non-reversible states, and can automatically induce a curriculum”). 

Allowable Subject Matter
Claims 5-7 and 21 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

	Claims 5-7 and 21 have been rejected under 35 U.S.C. 101 only. A complete prior art search was performed for these claims; however, no prior art was uncovered that disclose or fairly suggest the following claimed features: 
	Reason for allowance after detailed search, the cited arts, neither alone nor in combination, teach the claimed subject matter of claims 5 and 21, “wherein the absolute discriminator neural network and the relative discriminator neural network share an encoder neural network that generates encoded representations of received observations”. 
	Pertinent art (Nair et al., Visual Reinforcement Learning with Imagined Goals, 12/04/2018, https://arxiv.org/pdf/1807.04742): discloses the use of a VAE encoder and a VAE decoder in a reinforcement algorithm with goals, however the reference does not teach the required limitations of claims 5 and 21, the reference does not disclose multiple discriminators which share an encoder, and the encoder only encodes goals and states, not entire observations. 

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Yang et al., Hierarchical Cooperative Multi-Agent Reinforcement Learning with Skill Discovery, 05/08/2020, https://arxiv.org/pdf/1912.03558: This reference discloses a two-level hierarchical multiagent reinforcement learning (MARL) algorithm with unsupervised skill discovery. Agents learn useful and distinct skills at the low level via independent Q-learning, while they learn to select complementary latent skill variables at the high level via centralized multiagent training with an extrinsic team reward. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MOLLY CLARKE SIPPEL whose telephone number is (571)272-3270. The examiner can normally be reached Monday - Friday, 7:30 a.m. - 4:30 p.m. ET..
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached at (571)272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/M.C.S./Examiner, Art Unit 2122                                                                                                                                                                                                        
/KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122
Read full office action
Prosecution Timeline

Mar 08, 2023
Application Filed
Jan 08, 2026
Non-Final Rejection — §101, §102, §103
Apr 06, 2026
Examiner Interview Summary
Apr 06, 2026
Applicant Interview (Telephonic)
Precedent Cases

Applications granted by this same examiner with similar technology

17/929,541
Patent 12602592
NOISE COMMUNICATION FOR FEDERATED LEARNING
2y 5m to grant Granted Apr 14, 2026
17/932,941
Patent 12596916
CONSTRAINED MASKING FOR SPARSIFICATION IN MACHINE LEARNING
2y 5m to grant Granted Apr 07, 2026
Study what changed to get past this examiner. Based on 2 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds
Prosecution Projections

1-2
Expected OA Rounds
50%
Grant Probability
99%
With Interview (+58.3%)
3y 7m
Median Time to Grant
Low
PTA Risk
Based on 14 resolved cases by this examiner. Grant probability derived from career allow rate.
CONTROLLING AGENTS USING RELATIVE VARIATIONAL INTRINSIC CONTROL

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

CONTROLLING AGENTS USING RELATIVE VARIATIONAL INTRINSIC CONTROL

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email