Last updated: April 18, 2026
Application No. 18/072,175
ACTION SELECTION FOR REINFORCEMENT LEARNING USING A MANAGER NEURAL NETWORK THAT GENERATES GOAL VECTORS DEFINING AGENT OBJECTIVES

Non-Final OA §101§103§112§DP
Filed
Nov 30, 2022
Examiner
VAUGHN, RYAN C
Art Unit
2125
Tech Center
2100 — Computer Architecture & Software
Assignee
Deepmind Technologies Limited
OA Round
1 (Non-Final)
This examiner grants 62% of cases after interview

— +19.4% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 235 resolved cases, 2023–2026
Examiner Intelligence

VAUGHN, RYAN C View full profile →
Grants 62% of resolved cases
Career Allow Rate
145 granted / 235 resolved
+6.7% vs TC avg
Strong +19% interview lift
Without
With
+19.4%
Interview Lift
resolved cases with interview
Typical timeline
3y 9m
Avg Prosecution
45 currently pending
Career history
280
Total Applications
across all art units
Statute-Specific Performance

§101
23.9%
-16.1% vs TC avg
§103
40.1%
+0.1% vs TC avg
§102
7.6%
-32.4% vs TC avg
§112
21.9%
-18.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 235 resolved cases
Office Action

§101 §103 §112 §DP
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-22 are presented for examination.

Information Disclosure Statement
The information disclosure statements (IDS) submitted on November 30, 2022; March 22, 2023;  July 29, 2024; March 3, 2025; and February 6, 20261 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are being considered by the examiner.

Claim Objections
Claim 2 is objected to because of the following informalities: “goals vectors” should be “goal vectors”.  Appropriate correction is required.
Applicant is advised that should claims 19 and 20 be found allowable, claims 22 and 21 will be objected to under 37 CFR 1.75 as being substantial duplicates thereof. When two claims in an application are duplicates or else are so close in content that they both cover the same thing, despite a slight difference in wording, it is proper after allowing one claim to object to the other as being a substantial duplicate of the allowed claim. See MPEP § 608.01(m).

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: “manager neural network subsystem” in claims 1-2 and 11; “worker neural network subsystem” in claims 1 and 10; and “training subsystem” in claims 1 and 11.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 1-18 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention.
Claim limitations “manager neural network subsystem” in claims 1-2 and 11; “worker neural network subsystem” in claims 1 and 10; and “training subsystem” in claims 1 and 11 invoke 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. However, the written description fails to disclose the corresponding structure, material, or acts for performing the entire claimed function and to clearly link the structure, material, or acts to the function.  Therefore, it is unclear whether Applicant had possession of the claimed invention as of the effective filing date.  See rejection under 35 USC § 112(b) infra for further analysis.

The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-22 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim limitations “manager neural network subsystem” in claims 1-2 and 11; “worker neural network subsystem” in claims 1 and 10; and “training subsystem” in claims 1 and 11 invoke 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. However, the written description fails to disclose the corresponding structure, material, or acts for performing the entire claimed function and to clearly link the structure, material, or acts to the function.
Regarding the “manager neural network subsystem,” paragraph 7 repeats the claimed functions of the manager neural network subsystem, but it does not provide an algorithm for performing the entire claimed functions, nor do the further recitations of the subsystem in paragraphs 14-15.  The same is true of the “worker neural network subsystem”: paragraphs 7 and 14 repeat its functions but do not explain how the subsystem performs the functions.
Regarding the “training subsystem,” the specification never uses the term “training subsystem”.  A “training engine” is disclosed in paragraphs 56-57, which Examiner will assume is equivalent to the claimed “training subsystem,” but these paragraphs also merely repeat the claimed functions without providing an algorithm for performing them.
Therefore, the claims are indefinite and are rejected under 35 U.S.C. 112(b) or pre-AIA  35 U.S.C. 112, second paragraph.  For purposes of examination, any computer software that performs the claimed functions will be deemed to read on the claims.
Applicant may:
(a)        Amend the claims so that the claim limitations will no longer be interpreted as limitations under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph; 
(b)        Amend the written description of the specification such that it expressly recites what structure, material, or acts perform the entire claimed functions, without introducing any new matter (35 U.S.C. 132(a)); or 
(c)        Amend the written description of the specification such that it clearly links the structure, material, or acts disclosed therein to the functions recited in the claims, without introducing any new matter (35 U.S.C. 132(a)).
If applicant is of the opinion that the written description of the specification already implicitly or inherently discloses the corresponding structure, material, or acts and clearly links them to the functions so that one of ordinary skill in the art would recognize what structure, material, or acts perform the claimed functions, applicant should clarify the record by either: 
(a)        Amending the written description of the specification such that it expressly recites the corresponding structure, material, or acts for performing the claimed function and clearly links or associates the structure, material, or acts to the claimed functions, without introducing any new matter (35 U.S.C. 132(a)); or 
(b)        Stating on the record what the corresponding structure, material, or acts, which are implicitly or inherently set forth in the written description of the specification, perform the claimed functions. For more information, see 37 CFR 1.75(d) and MPEP §§ 608.01(o) and 2181.

In addition, claims 19-22 recite “a worker neural network system” where “a worker neural network system” has already been recited.  Thus, it is unclear whether the second recitation of “a worker neural network system” refers to the same system as, or a different system from, the first recitation.  For purposes of examination, Examiner will presume that the two refer to the same system and that the second recitation should read “the worker neural network system”.

All claims dependent on a claim rejected hereunder are also rejected for being dependent on a rejected base claim.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-22 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The analysis of the claims will follow the 2019 Revised Patent Subject Matter Eligibility Guidance, 84 Fed. Reg. 50 (“2019 PEG”).
Claim 1
Step 1:  The claim is directed to a system comprising a computer and storage devices communicatively coupled to the computer; therefore, it is directed to the statutory category of machines.
Step 2A Prong 1:  The claim recites, inter alia:
[A]t each of a plurality of time steps: generat[ing] a latent representation, in a latent space, of a current state of the environment at the time step:  This limitation could encompass mentally generating a latent representation representing a state of the environment at a time step.  Alternatively, this limitation represents a mathematical concept.
[G]enerat[ing], based at least in part on the latent representation of the current state of the environment at the time step, a goal vector that defines, in the latent state space, an objective to be accomplished as a result of actions performed by the agent in the environment:  This limitation could encompass the mental generation of the goal vector defining an objective to be accomplished as a result of actions performed by the agent in the environment.
[A]t each of the plurality of time steps: generate a respective action score for each action in the predetermined set of actions based at least in part on the goal vector for the time step:  This limitation could encompass the mental generation of an action score based on the goal vector.
[S]elect[ing] an action from the predetermined set of actions to be performed by the agent at the time step using the action scores:  This limitation could encompass the mental selection of the action to be performed by the agent using the action scores.
[D]etermining a respective reward for each time step of the plurality of time steps, comprising, for one or more time steps: determining the reward for the time step based at least in part on a difference in direction between: (i) a vector representing a change in the latent representation of the state of the environment from a preceding time step to the time step, and (ii) the goal vector for the preceding time step:  This limitation could encompass the mental determination of the reward based on a difference between a vector representing a change in the latent representation and the goal vector.  Additionally, taking the difference is a mathematical concept.
[T]raining [a] worker neural network subsystem on the rewards using reinforcement learning techniques:  Paragraphs 74-78 of the instant specification make clear that the training of the network is a mathematical algorithm.
Step 2A Prong 2:  This judicial exception is not integrated into a practical application.  The claim further recites that the system comprises “one or more computers and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to implement [the method]”.  The claim further recites that the operations are performed by a “manager neural network subsystem”, “worker neural network subsystem”, and “training subsystem”.  However, these are mere instructions to apply the judicial exception using a generic computer programmed with generically recited classes of computer algorithm.  MPEP § 2106.05(f).
Step 2B:  The claim does not contain significantly more than the judicial exception.  The analysis at this step mirrors that of step 2A, prong 2.  As an ordered whole, the claim is directed to a mathematical algorithm for selecting actions to be performed by an agent interacting with an environment, much of which can be performed mentally.  Nothing in the claim provides significantly more than this.  As such, the claim is not patent eligible.

Claim 2
Step 1:  A machine, as above.
Step 2A Prong 1:  The claim recites, inter alia, “at each of the plurality of time steps, … updat[ing] the goal vector for the time step by pooling the goal vector for the time step with goal[] vectors for one or more preceding time steps.”  This limitation could encompass mentally pooling the current goal vector with preceding goal vectors.
Step 2A Prong 2:  This judicial exception is not integrated into a practical application.  The claim further recites that the pooling is to be performed by “the manager neural network subsystem”.  However, this is a mere instruction to apply the judicial exception using a generic computer programmed with a generic class of computer algorithm.  MPEP § 2106.05(f).
Step 2B:  The claim does not contain significantly more than the judicial exception.  The claim further recites that the pooling is to be performed by “the manager neural network subsystem”.  However, this is a mere instruction to apply the judicial exception using a generic computer programmed with a generic class of computer algorithm.  MPEP § 2106.05(f).

Claim 3
Step 1:  A machine, as above.
Step 2A Prong 1:  The claim recites, inter alia, “ processing the latent representation … in accordance with a … state … to generate the goal vector and to update the … state”.  This limitation could encompass mentally processing the latent representation to generate a goal vector and update a state.
Step 2A Prong 2:  This judicial exception is not integrated into a practical application.  The claim further recites that the processing is performed “using a goal recurrent neural network … in accordance with a hidden state of the goal recurrent neural network”.  However, this is a mere instruction to apply the judicial exception using a generic computer programmed with a generic class of computer algorithm.  MPEP § 2106.05(f).
The claim further recites “receiv[ing] the latent representation”.  This limitation is directed to the insignificant extra-solution activity of mere data gathering and output.  MPEP § 2106.05(g).
Step 2B:  The claim does not contain significantly more than the judicial exception.  The receiving limitation is directed to the well-understood, routine, and conventional activity of receiving and transmitting data over a network.  MPEP § 2106.05(d)(II); OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network).  Otherwise, the analysis at this step mirrors that of step 2A, prong 2.

Claim 4
Step 1:  A machine, as above.
Step 2A Prong 1:  The claim recites:
[G]enerating a respective action embedding vector in an embedding space for each action in the predetermined set of actions:  This limitation could encompass mentally generating an embedding vector for each action of a set of actions; it also represents a mathematical concept.
[P]rojecting the goal vector for the time step to the embedding space to generate a goal embedding vector: This limitation represents the mathematical concept of projecting a vector onto another vector space.
[M]odulating the respective action embedding vector for each action by the goal embedding vector to generate the respective action score for each action in the predetermined set of actions:  This limitation could encompass mentally modulating the action embedding vector by the goal embedding vector and generating an action score based thereon.
Step 2A Prong 2:  This judicial exception is not integrated into a practical application.  See claim 1 analysis.
Step 2B:  The claim does not contain significantly more than the judicial exception.  See claim 1 analysis.

Claim 5
Step 1:  A machine, as above.
Step 2A Prong 1:  The claim recites that “selecting the action comprises selecting the action having a highest action score.”  This limitation could encompass mentally selecting the action with the highest score.
Step 2A Prong 2:  This judicial exception is not integrated into a practical application.  See claim 1 analysis.
Step 2B:  The claim does not contain significantly more than the judicial exception.  See claim 1 analysis.

Claim 6
Step 1:  A machine, as above.
Step 2A Prong 1:  The claim recites, inter alia, “processing a representation of the current state of the environment …, in accordance with a … state …, to generate the action embedding vectors and to update the … state”.  This limitation could encompass mentally processing the representation of the state to generate action embedding vectors and update the state.
Step 2A Prong 2:  This judicial exception is not integrated into a practical application.  The claim further recites that the representation is processed “using an action score recurrent neural network” and that the states are “hidden” states of the network.  However, these are mere instructions to apply the judicial exception using a generic computer programmed with a generically recited class of computer algorithm.  MPEP § 2106.05(f).
Step 2B:  The claim does not contain significantly more than the judicial exception.  The claim further recites that the representation is processed “using an action score recurrent neural network” and that the states are “hidden” states of the network.  However, these are mere instructions to apply the judicial exception using a generic computer programmed with a generically recited class of computer algorithm.  MPEP § 2106.05(f).

Claim 7
Step 1:  A machine, as above.
Step 2A Prong 1:  The claim recites that “the goal vector has a higher dimensionality than the goal embedding vector.”  The generation of the action score using the goal vector and goal embedding vector remains a mental process/mathematical concept under these further assumptions.
Step 2A Prong 2:  This judicial exception is not integrated into a practical application.  See claim 4 analysis.
Step 2B:  The claim does not contain significantly more than the judicial exception.  See claim 4 analysis.

Claim 8
Step 1:  A machine, as above.
Step 2A Prong 1:  The claim recites that “the dimensionality of the goal vector is at least ten times higher than the dimensionality of the goal embedding vector.”  The generation of the action score using the goal vector and goal embedding vector remains a mental process/mathematical concept under these further assumptions.
Step 2A Prong 2:  This judicial exception is not integrated into a practical application.  See claim 7 analysis.
Step 2B:  The claim does not contain significantly more than the judicial exception.  See claim 7 analysis.

Claim 9
Step 1:  A machine, as above.
Step 2A Prong 1:  The claim recites, inter alia, “determining the reward for the time step based at least in part on the external reward for the time step.”  This limitation could encompass the mental determination of the reward based on the external reward.
Step 2A Prong 2:  This judicial exception is not integrated into a practical application.  The claim further recites “ receiving an external reward for the time step as a result of the agent performing selected actions”.  This limitation is directed to the insignificant extra-solution activity of mere data gathering and output.  MPEP § 2106.05(g).
Step 2B:  The claim does not contain significantly more than the judicial exception.  The claim further recites “ receiving an external reward for the time step as a result of the agent performing selected actions”.  This limitation is directed to the well-understood, routine, and conventional activity of receiving and transmitting data over a network.  MPEP § 2106.05(d)(II); OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network).

Claim 10
Step 1:  A machine, as above.
Step 2A Prong 1:  The claim recites “training the worker neural network subsystem to generate action scores that maximize a time discounted combination of rewards.”  As noted above, the training represents a mathematical concept.
Step 2A Prong 2:  This judicial exception is not integrated into a practical application.  See claim 1 analysis.
Step 2B:  The claim does not contain significantly more than the judicial exception.  See claim 1 analysis.

Claim 11
Step 1:  A machine, as above.
Step 2A Prong 1:  The claim recites “training the manager neural network subsystem to generate goal vectors that result in action scores that encourage selection of actions that increase the external rewards received as a result of the agent performing the selected actions.”  As noted above, the training of the neural networks represents a mathematical concept.
Step 2A Prong 2:  This judicial exception is not integrated into a practical application.  See claim 9 analysis.
Step 2B:  The claim does not contain significantly more than the judicial exception.  See claim 9 analysis.

Claim 12
Step 1:  A machine, as above.
Step 2A Prong 1:  The claim recites, inter alia, “processing an observation characterizing the current state of the environment”.  This limitation could encompass mentally processing the observation.
Step 2A Prong 2:  This judicial exception is not integrated into a practical application.  The claim further recites that the processing occurs “using a convolutional neural network”.  However, this is a mere instruction to apply the judicial exception using a generic computer programmed with a generic class of computer algorithm.  MPEP § 2106.05(f).
Step 2B:  The claim does not contain significantly more than the judicial exception.  The claim further recites that the processing occurs “using a convolutional neural network”.  However, this is a mere instruction to apply the judicial exception using a generic computer programmed with a generic class of computer algorithm.  MPEP § 2106.05(f).

Claim 13
Step 1:  A machine, as above.
Step 2A Prong 1:  The claim recites, inter alia:
[M]aintain[ing] an internal state that is partitioned into r sub-states, wherein r is an integer greater than one:  This limitation could encompass mentally partitioning a state into multiple sub-states.
[S]elect[ing] a sub-state from the r sub-states:  This limitation could encompass mentally selecting the state.
[P]rocess[ing] current values of the selected sub-state and the network input for the time step … to update the current values of the selected sub-state and to generate a[n] … output for the time step in accordance with current values of a set of LSTM network parameters:  This limitation could encompass mentally processing the input and current values to update the state and generate an output based on network parameters.
Step 2A Prong 2:  This judicial exception is not integrated into a practical application.  The claim further recites that “the goal recurrent neural network is a dilated long short- term memory (LSTM) neural network, wherein the dilated LSTM neural network is configured to maintain [the state]”.  The claim also recites that the processing occurs “using an LSTM neural network” and that the output is a “network output”.  However, these are mere instructions to apply the judicial exception using a generic computer programmed with a generically recited class of computer algorithm.  MPEP § 2106.05(f).  
Step 2B:  The claim does not contain significantly more than the judicial exception.  The claim further recites that “the goal recurrent neural network is a dilated long short- term memory (LSTM) neural network, wherein the dilated LSTM neural network is configured to maintain [the state]”.  The claim also recites that the processing occurs “using an LSTM neural network” and that the output is a “network output”.  However, these are mere instructions to apply the judicial exception using a generic computer programmed with a generically recited class of computer algorithm.  MPEP § 2106.05(f).  

Claim 14
Step 1:  A machine, as above.
Step 2A Prong 1:  The claim recites, inter alia, “for each of the time steps: pool[ing] the … output for the time step and the … outputs for up to a predetermined number of preceding time steps to generate a final … output for the time step.”  This limitation could encompass mentally pooling outputs to generate a final output.
Step 2A Prong 2:  This judicial exception is not integrated into a practical application.  The claim further recites that the output is a “network output” and that the pooling is performed using “the dilated LSTM neural network”.  However, these are mere instructions to apply the judicial exception using a generic computer programmed with a generically recited class of computer algorithm.  MPEP § 2106.05(f).
Step 2B:  The claim does not contain significantly more than the judicial exception.  The claim further recites that the output is a “network output” and that the pooling is performed using “the dilated LSTM neural network”.  However, these are mere instructions to apply the judicial exception using a generic computer programmed with a generically recited class of computer algorithm.  MPEP § 2106.05(f).

Claim 15
Step 1:  A machine, as above.
Step 2A Prong 1:  The claim recites that “pooling the network outputs comprises summing the network outputs”.  This limitation recites a mathematical concept of calculating a sum and may be performed mentally.
Step 2A Prong 2:  This judicial exception is not integrated into a practical application. See claim 14 analysis.
Step 2B:  The claim does not contain significantly more than the judicial exception.  See claim 14 analysis.

Claim 16
Step 1:  A machine, as above.
Step 2A Prong 1:  The claim recites that “the time steps in the plurality of time steps are indexed starting from 1 for the first time step in the plurality of time steps to T for the last time step in the plurality of time steps, wherein each sub-state is assigned an index ranging from 1 to r”.  Maintaining a state that is partitioned into these time steps remains mentally performable under these further assumptions.  The claim further recites “selecting a sub-state from the r sub-states comprises: selecting the sub-state having an index that is equal to the index of the time step modulo r.”  This represents a mathematical concept of taking the index of the time step modulo r and selecting the substate with that index; this could also be performed mentally.
Step 2A Prong 2:  This judicial exception is not integrated into a practical application. See claim 13 analysis.
Step 2B:  The claim does not contain significantly more than the judicial exception.  See claim 13 analysis.

Claim 17
Step 1:  A machine, as above.
Step 2A Prong 1:  The claim recites the same judicial exceptions as in claim 13.
Step 2A Prong 2:  This judicial exception is not integrated into a practical application.  The claim further recites that “the LSTM neural network comprises a plurality of LSTM layers.”  However, this is a mere instruction to apply the judicial exception using a generic computer programmed with a generic class of computer algorithm.  MPEP § 2106.05(f).
Step 2B:  The claim does not contain significantly more than the judicial exception.  The claim further recites that “the LSTM neural network comprises a plurality of LSTM layers.”  However, this is a mere instruction to apply the judicial exception using a generic computer programmed with a generic class of computer algorithm.  MPEP § 2106.05(f).

Claim 18
Step 1:  A machine, as above.
Step 2A Prong 1:  The claim recites, inter alia, “setting an internal state … to the current values of the selected sub-state for the processing of the … input at the time step.”  This limitation could encompass mentally setting an internal state to the current values of the selected sub-state and processing the input at the time step using the state.
Step 2A Prong 2:  This judicial exception is not integrated into a practical application.  The claim further recites that the state is “of the LSTM neural network” and that the input is a “network input”.  However, these are mere instructions to apply the judicial exception using a generic computer programmed with a generically recited class of computer algorithm.  MPEP § 2106.05(f).
Step 2B:  The claim does not contain significantly more than the judicial exception.  The claim further recites that the state is “of the LSTM neural network” and that the input is a “network input”.  However, these are mere instructions to apply the judicial exception using a generic computer programmed with a generically recited class of computer algorithm.  MPEP § 2106.05(f).

Claims 19, 22
Step 1:  The claims recite non-transitory computer storage media; therefore, they are directed to the statutory category of articles of manufacture.
Step 2A Prong 1:  The claims recite the same judicial exceptions as in claim 1.
Step 2A Prong 2:  This judicial exception is not integrated into a practical application.  The analysis at this step mirrors that of claim 1, except insofar as these claims are directed to “[o]ne or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations for selecting actions to be performed by an agent that interacts with an environment by performing actions from a predetermined set of actions”.  However, this is a mere instruction to apply the judicial exception using a generic computer.  MPEP § 2106.05(f).
Step 2B:  The claim does not contain significantly more than the judicial exception.  The analysis at this step mirrors that of claim 1, except insofar as these claims are directed to “[o]ne or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations for selecting actions to be performed by an agent that interacts with an environment by performing actions from a predetermined set of actions”.  However, this is a mere instruction to apply the judicial exception using a generic computer.  MPEP § 2106.05(f).

Claims 20-21
Step 1:  The claims recite a method; therefore, they are directed to the statutory category of processes.
Step 2A Prong 1:  The claims recite the same judicial exceptions as in claim 1.
Step 2A Prong 2:  This judicial exception is not integrated into a practical application.  The analysis at this step mirrors that of claim 1, except insofar as these claims recite that the method is “performed by one or more data processing apparatus”.  However, this is a mere instruction to apply the judicial exception using a generic computer.  MPEP § 2106.05(f).
Step 2B:  The claims do not contain significantly more than the judicial exception.  The analysis at this step mirrors that of claim 1, except insofar as these claims recite that the method is “performed by one or more data processing apparatus”.  However, this is a mere instruction to apply the judicial exception using a generic computer.  MPEP § 2106.05(f).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1 and 19-22 are rejected under 35 U.S.C. 103 as being unpatentable over Léon et al., "Options Discovery with Budgeted Reinforcement Learning," in arXiv preprint arXiv:1611.06824 (2016) ("Léon") in view of Dayan et al., "Feudal Reinforcement Learning," in 5 Advances in Neural Info. Processing Sys. 271-78 (1993) ("Dayan") and further in view of Ghesu et al. (US 9569736) (“Ghesu”).
Regarding claim 1, Léon discloses “[a] system for selecting actions to be performed by an agent that interacts with an environment by performing actions from a predetermined set of actions (an actor model [worker neural network subsystem] updates an actor state h and computes a next action a; the next action a is drawn from the distribution softmax(fact(ht)) [predetermined set = entire distribution] – Léon, section 4.1, subsection entitled "Actor Model"), the system comprising one or more computers and one or more storage devices communicatively coupled to the one or more computers (in the computer science domain, works involving sequentially solving sub-tasks have led to the hierarchical reinforcement learning paradigm [implying that reinforcement learning is conducted by a computer containing a storage device with instructions] – Léon, sec. 1, second paragraph), wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to implement: 
a manager neural network subsystem that is configured to, at each of a plurality of time steps: 
generate a … representation … of a current state of the environment at the time step (in the BONN(Budgeted Options NeuralNetwork) architecture, theagent can choose to acquire a high-level observation yt [representation of current state of environment] that will provide more relevant information in addition to a low-level observation xt –  Léon, sec. 4.1, first paragraph); 
generate, based at least in part on the representation of the current state of the environment at the time step, a goal vector that defines, in [a] latent state space, an objective to be accomplished as a result of actions performed by the agent in the environment (structure of BONN includes an option model [manager neural network subsystem] that uses observations xt and yt to compute a new option denoted ot [goal vector] as a vector in a latent space – Léon, sec. 4.1, second paragraph; see also Fig. 2, top row denoted "option model"; options are the mechanism by which sub-tasks [objectives] are modeled, giving rise to the question of how to select actions to apply in the environment based on the chosen option – id. at sec. 1, second paragraph); and 
a worker neural network subsystem that is configured to, at each of the plurality of time steps: 
generate a respective action score for each action in the predetermined set of actions based at least in part on the goal vector for the time step (an actor model [worker neural network subsystem] updates an actor state ht and computes a next action at; the next action at is drawn from the distribution softmax(fact(hi)) [action score; predetermined set = entire distribution] – Léon, section 4.1, subsection entitled "Actor Model"; see also Fig. 2, bottom row denoted "Actor Model"); and 
select an action from the predetermined set of actions to be performed by the agent at the time step using the action scores (an actor model updates an actor state ht and computes a next action at; the next action at is drawn [selected] from the distribution softmax(fact(ht)) [action score; predetermined set = entire distribution] – Léon, section 4.1,subsection entitled "Actor Model"); and 
a training subsystem that is configured to perform operations comprising: 
determining a respective reward for each time step of the plurality of time steps (Léon Fig. 1 shows that the agent receives a reward rt at each timestep) ….”
	Léon appears not to disclose explicitly the further limitations of the claim. However, Dayan discloses "generat[ing] a latent representation, in a latent space, of a current state of the environment (in a maze task in a feudal reinforcement learning system, a grid is split up into successively finer grains and managers are assigned to separable parts of the maze at each level – Dayan, sec. 3, first paragraph; see also Fig. 1 [4-square grid of the maze [environment], for instance, is a latent representation of the 16-square grid]) …; [and] 
generat[ing], based at least in part on [a] latent representation of the current state of the environment at the time step, a[] goal vector (in a maze task in a feudal reinforcement learning system, a goal is set at the lowest level of granularity of the grid, and the feudal system learn to navigate to the goal by learning not to try impossible actions or moving to another level at inappropriate places; if the system decides at a high level that the goal is in one part of the maze, then it has the capacity to specify large scale actions at that level to take it there – Dayan, p. 275, second full paragraph and Fig. 1 [for instance, in the example of Fig. 1, if the system quickly discovers that the goal is in the southwest quadrant of the maze, it can specify the goal vector as 1-(1,1) in the 4-square latent space and explore the 16- and 32- square grids for the goal]) " 
Dayan and the instant application both relate to hierarchical reinforcement learning and are analogous. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Léon to generate a vector representing a goal in a latent space representing a state of the environment at a given time, as disclosed by Dayan, and an ordinary artisan could reasonably expect to have done so successfully. Doing so would ensure that the system is aware of the goal to which it is directed and increase efficiency by allowing the system to explore how to achieve that goal in a lower-dimensional space than it would if it had to explore the full space. See Dayan, sec. 1 (indicating that high-level managers can send agents directly to a region of the state space with a high probability of reward without forcing it to explore in detail).
Neither Léon nor Dayan appears to disclose explicitly the further limitations of the claim.  However, Ghesu discloses “for one or more time steps: 
determining the reward for the time step based at least in part on a difference in direction between: (i) a vector representing a change in the … representation of the state of the environment from a preceding time step to the time step, and (ii) the goal vector for the preceding time step (each action of the agent in an image environment is simplified to a single pixel move: xt+1 [Wingdings font/0xDF] xt ± 1 and yt+1 [Wingdings font/0xDF] yt  ± 1; reward system is based on the change in relative position at state st: (xt, yt) with respect to the target position of the landmark starget: (xtarget, ytarget); for a move in the correct direction, a positive reward proportional to the target-distance reduction is given, whereas a move in the wrong direction is punished by a negative reward of equal magnitude; the reward at time t is given by rt = dist(st, starget) – dist(st+1, starget) – Ghesu, col. 15, ll. 8-40 [note that st represents the change in the representation of the environment from time t – 1 to t because the state at the previous time step must be an 8-neighbor of the state at the current time step; note also that the goal vector starget does not change from timestep to timestep, so the reward is implicitly based on the value of starget for time t – 1; note finally that the distance between s-t and starget is based on the difference in direction between the two]); and 
training the worker neural network subsystem on the rewards using reinforcement learning techniques (system for artificial agent training for intelligent landmark identification includes evaluating a set of training images; a reward value indicative of a proximity of a current state space to a pre-defined landmark target of each training image is calculated – Ghesu, col. 2, ll. 37-60).”  
Ghesu and the instant application both relate to reinforcement learning and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon and Dayan to determine the reward based on a difference in direction between vectors, as disclosed by Ghesu, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would allow the system to train more efficiently by moving the agent in the direction of the goal.  See Ghesu, col. 2, ll. 37-60.

Claims 19 and 22 are non-transitory computer storage media claims corresponding to system claim 1 and are rejected for the same reasons as given in the rejection of that claim.  Similarly, claims 20 and 21 are method claims corresponding to system claim 1 and are rejected for the same reasons as given in the rejection of that claim.

Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over Léon in view of Dayan and Ghesu and further in view of Dai et al., “Fudan-Huawei at MediaEval 2015: Detecting Violent Scenes and Affective Impact in Movies with Deep Learning,” in MediaEval (2015) (“Dai”).
Regarding claim 2, the rejection of claim 1 is incorporated.  León further discloses a “goal vector,” as shown in the rejection of claim 1.
Neither Léon, Dayan, nor Ghesu appears to disclose explicitly the further limitations of the claim.  However, Dai discloses that “at each of the plurality of time steps, the manager neural network subsystem is further configured to: 
update the goal vector for the time step by pooling the goal vector for the time step with goal[] vectors for one or more preceding time steps (LSTM model trained with another video dataset is adopted and the average [pooled] output from all the time-steps of the last LSTM layers is used as the feature [final vector] – Dai, sec. 1.1, last paragraph before “Conventional features”).”  
Dai and the instant application both relate to neural networks and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, and Ghesu to pool the vectors for multiple previous time steps to generate a final vector, as disclosed by Dai, and an ordinary artisan could reasonably have expected to do so successfully.  Doing so would reduce computational load and allow the system to generate an output under time constraints.  See Dai, sec. 1.1, last paragraph before “Conventional features”.

Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Léon in view of Dayan and Ghesu and further in view of Heess et al., “Learning and Transfer of Modulated Locomotor Controllers,” in arXiv preprint arXiv:1610.05182 (2016) (“Heess”).
Regarding claim 3, the rejection of claim 1 is incorporated.  Dayan further discloses a “latent representation”, as shown in the rejection of claim 1.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Léon/Ghesu to provide a latent representation, as disclosed by Dayan, for substantially the same reasons as given in the rejection of claim 1.
Neither Léon, Dayan, nor Ghesu appears to disclose explicitly the further limitations of the claim.  However, Heess discloses that “at each of the plurality of time steps, generating the goal vector, comprises: 
processing the … representation using a goal recurrent neural network, wherein the goal recurrent neural network is configured to receive the … representation and to process the … representation in accordance with a hidden state of the goal recurrent neural network to generate the goal vector and to update the hidden state of the goal recurrent neural network (recurrent high-level controller [goal recurrent neural network] processes the observation [representation, denoted ot-1 in the figure] and generates a control signal [goal vector, denoted ct-1 in the figure] in accordance with the internal state of the system at the previous time step [denoted, say, by zt-2 in the figure] and also updates the internal state [denoted by zt-1] – Heess, Fig. 1; see also sec. 2, p. 3).”
Heess and the instant application both relate to reinforcement learning and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, and Ghesu to process the representation using a goal recurrent neural network, as disclosed by Heess, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would allow the system to take abstract high-level actions that move the agent closer to the goal.  See Heess, Fig. 1 and four bullet points below.

Claims 4 and 7 are rejected under 35 U.S.C. 103 as being unpatentable over Léon in view of Dayan and Ghesu and further in view of Schaul et al., “Universal Value Function Approximators,” in Intl. Conf. Machine Learning 1312-1320 (2015) (“Schaul”).
Regarding claim 4, neither Léon, Dayan, nor Ghesu appears to disclose explicitly the further limitations of the claim.  However, Schaul discloses that “at each of the plurality of time steps, generating the respective action score for each action in the predetermined set of actions comprises: 
generating a respective action embedding vector in an embedding space for each action in the predetermined set of actions (classical grid-world with 4 rooms and an action space with 4 cardinal directions is used [action embedding vectors = vectors encoding the four cardinal directions] – Schaul, penultimate paragraph before sec. 4.1); 
projecting the goal vector for the time step to the embedding space to generate a goal embedding vector (data are viewed as a sparse table of values that contains one row for each observed state s and one column for each observed goal g, and find a low-rank factorization of the table into state embeddings and goal embeddings – Schaul, penultimate paragraph before section 2); and 
modulating the respective … embedding vector … by the goal embedding vector to generate the respective action score for each action in the predetermined set of actions (one possible function approximator simply concatenates state and goal together as a joint input; the mapping from concatenated input to regression target can then be dealt with a non-linear function approximator such as a multi-layer perceptron – Schaul, sec. 3, second paragraph; the result is an approximation of the action-value function Q(s, a, g) [action score for each action] – id. at Algorithm 1, last line [“modulate” is being interpreted here to mean “combine”]).”  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, and Ghesu to create and combine two sets of embedding vectors to generate an action score, as disclosed by Schaul, and an ordinary artisan could reasonably have expected to do so successfully.  One motivation for doing so would be to perform learning faster than a naïve approach that does not include embedding.  See Schaul, penultimate paragraph before section 2.

Regarding claim 7, neither Léon, Dayan, nor Ghesu appears to disclose explicitly the further limitations of the claim.  However, Schaul discloses that “the goal vector has a higher dimensionality than the goal embedding vector (all the values are laid out in a data matrix with one row for each observed state s and one column for each observed goal g, and that matrix is factorized, finding a low-rank approximation that defines n-dimensional embedding spaces for both states and goals [“low-rank approximation” implies that n is less than the goal space dimensionality] – Schaul, section 3.1, bullet point 1).”  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, and Ghesu to reduce the dimensionality of the goal space, as disclosed by Schaul, and an ordinary artisan could reasonably have expected to do so successfully. One motivation for doing so would be to perform learning faster than a naïve approach without embedding.  See Schaul, penultimate paragraph before section 2.

Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Léon in view of Dayan and Ghesu and further in view of Graepel et al. (US 20180032864) (“Graepel”).
Regarding claim 5, neither Léon, Dayan, nor Ghesu appears to disclose explicitly the further limitations of the claim.  However, Graepel discloses that “selecting the action comprises selecting the action having a highest action score (system selects the action represented by the outgoing edge having the highest action score as the action to be performed by the agent in response to the current observation – Graepel, paragraph 81).”  
Graepel and the instant application both relate to neural networks and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, and Ghesu to select the action having the highest action score, as disclosed by Graepel, and an ordinary artisan could reasonably have expected to do so successfully.  One motivation for doing so would be to maximize the likelihood that the agent will complete the objectives if the action is performed.  See Graepel, paragraph 49.

Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over Léon in view of Dayan, Ghesu, and Schaul and further in view of Shi et al., “Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting,” in Advances in Neural Info. Processing Systems 802-810 (2015) (“Shi”).
Regarding claim 6, the rejection of claim 4 is incorporated.  Schaul further discloses an “action score … neural network (Schaul sec. 2 discloses a network that calculates an action-value function [action score])” and “action embedding vectors (see mapping of this element in the rejection of claim 4 supra) ….”  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, and Ghesu to employ action score networks and action embedding vectors, as disclosed by Schaul, for substantially the same reasons as given in the rejection of claim 4.
Neither Léon, Dayan, Ghesu, nor Schaul appears to disclose explicitly the further limitations of the claim.  However, Shi discloses that “generating the respective action embedding vector in the embedding space for each action in the predetermined set of actions comprises: 
processing a representation of the current state of the environment using a[] … recurrent neural network, in accordance with a hidden state of the … recurrent neural network, to generate the [output] vectors and to update the hidden state of the … recurrent neural network (LSTM characterized by the following equations, where it is an input gate [representation], ft is a forget gate, ct is a cell status, ot is an output gate [output vector], and ht is a hidden state, which is updated at each time step by virtue of the changes to ot and ct at each time step [current hidden state = ht-1]:

    PNG
    media_image1.png
    93
    504
    media_image1.png
    Greyscale

Shi, sec. 2.2).”  
Shi and the instant application both relate to neural networks and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, Ghesu, and Schaul to process a representation of the input using a recurrent neural network to generate an output, as disclosed by Shi, and an ordinary artisan could reasonably have expected to do so successfully.  Doing so would allow the system to perform sequence learning by capturing temporal dependencies.  See Shi, penultimate paragraph before sec. 2.

Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over Léon in view of Dayan, Ghesu, and Schaul and further in view of Mei et al. (US 20150356199) (“Mei”).
Regarding claim 8, the rejection of claim 7 is incorporated.  Léon further discloses a ^goal vector, as shown above in the rejection of claim 1.
Schaul discloses a “goal embedding vector”, as shown in the rejection of claim 4.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, and Ghesu to employ a goal embedding vector, as disclosed by Schaul, for substantially the same reasons as given in the rejection of claim 7. 
Neither Léon, Dayan, Schaul, nor Ghesu appears to disclose explicitly the further limitations of the claim.  However, Mei discloses that “the dimensionality of [one] … vector is at least ten times higher than the dimensionality of [another] … vector (click-through-based cross-view learning techniques can reduce feature dimension by several orders of magnitude (e.g., from thousands to tens) – Mei, paragraph 10).”  
Mei and the instant application both relate to machine learning and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, Schaul, and Ghesu to reduce the dimensionality of the data by at least ten times, as disclosed by Mei, and an ordinary artisan could reasonably have expected to do so successfully.  One motivation for doing so would be to produce memory savings.  See Mei, paragraph 10.

Claims 9-11 are rejected under 35 U.S.C. 103 as being unpatentable over Léon in view of Dayan and Ghesu and further in view of Kulkarni et al., “Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation,” in Advances in Neural Info. Processing Systems 3675-3683 (2016) (“Kulkarni”).
Regarding claim 9, neither Léon, Dayan nor Ghesu appears to disclose explicitly the further limitations of the claim.  However, Kulkarni discloses that “determining the respective reward for each time step of the plurality of time steps comprises, for one or more time steps: 
receiving an external reward for the time step as a result of the agent performing selected actions (objective function for a controller is to maximize cumulative intrinsic reward and the objective of a meta-controller is to optimize the cumulative extrinsic reward [external reward], where the cumulative extrinsic reward is a function of the agent being in a state after taking an action [i.e., as a result of the agent performing selected actions], the cumulative intrinsic reward is a function of a goal g, and the discounting in the cumulative extrinsic reward is over sequences of goals; objective of the agent is to maximize the extrinsic reward function over long periods of time – Kulkarni, pp. 3-4, sec. 3, up through “Temporal Abstractions”); and 
determining the reward for the time step based at least in part on the external reward for the time step (objective function for a controller is to maximize cumulative intrinsic reward and the objective of a meta-controller is to optimize the cumulative extrinsic reward [external reward], where the cumulative extrinsic reward is a function of the agent being in a state after taking an action, the cumulative intrinsic reward is a function of a goal g, and the discounting in the cumulative extrinsic reward is over sequences of goals; objective of the agent is to maximize the extrinsic reward function over long periods of time [i.e., the cumulative extrinsic reward [reward for the time step] is based in part on the instant extrinsic reward [external reward for the time step]] – Kulkarni, pp. 3-4, sec. 3, up through “Temporal Abstractions”).”  
Kulkarni and the instant application both relate to machine learning and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, and Ghesu to employ a reward for a time step based on an external reward, as disclosed by Kulkarni, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would allow the system to take the optimal action for a given time step, thereby facilitating the agent achieving the goal.  See Kulkarni, pp. 3-4, sec. 3.

Regarding claim 10, neither Léon, Dayan, nor Ghesu appears to disclose explicitly the further limitations of the claim.  However, Kulkarni discloses that “training the worker neural network subsystem on the rewards using reinforcement learning techniques comprises: 
training the worker neural network subsystem to generate action scores that maximize a time discounted combination of rewards (objective function for a controller is to maximize cumulative intrinsic reward and the objective of a meta-controller is to optimize the cumulative extrinsic reward, where the cumulative extrinsic reward is a function of the agent being in a state after taking an action, the cumulative intrinsic reward is a function of a goal g, and the discounting in the cumulative extrinsic reward is over sequences of goals; objective of the agent is to maximize the extrinsic reward function over long periods of time; policy over actions and policy over goals are produced by estimating action-value functions [action scores] – Kulkarni, pp. 3-4, sec. 3, up through “Temporal Abstractions”; see also Fig. 1).”  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, and Ghesu to generate a combination of reward functions, one of which is extrinsic and another of which is intrinsic, as disclosed by Kulkarni, and an ordinary artisan could reasonably have expected to do so successfully.  One motivation for doing so would be to allow the agent to explore behavior for its own sake, thereby helping the agent solve tasks posed by the environment.  See Kulkarni, abstract.

Regarding claim 11, neither Léon, Dayan, nor Ghesu appears to disclose explicitly the further limitations of the claim.  However, Kulkarni discloses that “the operations performed by the training subsystem further comprise: 
training the manager neural network subsystem to generate goal vectors that result in action scores that encourage selection of actions that increase the external rewards received as a result of the agent performing the selected actions (meta-controller receives a state and chooses a goal in the set of all possible goals; goal remains in place for the next few time steps until either it is achieved or a terminal state is reached – Kulkarni, p. 3, paragraph titled “Temporal Abstractions”; meta-controller looks at the raw states and produces a policy over goals by estimating an action-value function to maximize expected future extrinsic reward, controller takes in states and the current goal and produces a policy over actions by estimating a second action-value function [action score] to solve the predicted goal by maximizing expected future intrinsic reward; internal critic provides a positive reward to the controller iff the goal is reached – id. at Fig. 1 caption [so that an action will be chosen if it results in a positive, i.e., increased, reward]).”  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, and Ghesu to generate goal vectors that encourage the selection of actions that move the agent closer to the goal, as disclosed by Kulkarni, and an ordinary artisan could reasonably have expected to do so successfully.  One motivation for doing so would be to help the agent solve tasks posed by the environment.  See Kulkarni, abstract.

Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Léon in view of Dayan and Ghesu and further in view of Bichler (US 20170200078) (“Bichler”).
Regarding claim 12, neither Léon, Dayan, nor Ghesu appears to disclose explicitly the further limitations of the claim.  However, Bichler discloses that “generating the latent representation, in the latent space, of the current state of the environment at the time step comprises: 
processing an observation characterizing the current state of the environment using a convolutional neural network (convolutional neural networks are feedforward neural networks that allow for a learning of intermediate representations of objects that are smaller and can be generalized for similar objects – Bichler, paragraph 7).”  
Bichler and the instant application both relate to neural networks and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, and Ghesu to generate an intermediate representation using a convolutional neural network, as disclosed by Bichler, and an ordinary artisan could reasonably have expected to do so successfully.  One motivation for doing so would be to facilitate object recognition by representing objects in a way that can be generalized for similar objects.  See Bichler, paragraph 7.

Claims 13, 16, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Léon in view of Dayan, Ghesu, and Heess and further in view of Koutnik et al., “A Clockwork RNN,” in arXiv preprint arXiv:1402.3511 (2014) (“Koutnik”), Williams et al., “End-to-end LSTM-based Dialog Control Optimized with Supervised and Reinforcement Learning,” in arXiv preprint arXiv:1606.01269 (2016) (“Williams”), and Yu et al., “Multi-Scale Context Aggregation by Dilated Convolutions,” in ICLR 2016 (2016) (“Yu”).
Regarding claim 13, neither Léon, Dayan, Ghesu, nor Heess appears to disclose explicitly the further limitations of the claim.  However, Koutník discloses that “the … neural network is configured to maintain an internal state that is partitioned into r sub-states, wherein r is an integer greater than one (neurons in hidden layer of clockwork RNN are partitioned into g [r] modules of size k, each of which is assigned a clock period Tn – Koutník, section 3, first paragraph [the use of the plural “modules” suggests that g > 1]), and … the … neural network is configured to, at each time step in the plurality of time steps: 
receive a network input for the time step (input weight matrix WI is partitioned into g blocks-rows – Koutník, section 3, bottom of right hand column, especially equation 3); 
select a sub-state from the r sub-states (at each CW-RNN time step t, only the output of modules i [sub-state] that satisfy (t MOD Ti) = 0 are executed – Koutník, sec. 3, last full paragraph on p. 3); and 
process current values of the selected sub-state and the network input for the time step using a[] … neural network to update the current values of the selected sub-state and to generate a network output for the time step in accordance with current values of a set of … network parameters (at each forward pass time step, only the block-rows of input weight matrix WI and WH, a block-upper triangular hidden weight matrix [network parameters], that correspond to the executed modules are used for evaluation and the corresponding parts of an output vector yH are updated; standard RNN state update equation is governed by                         
                            
                                
                                    y
                                
                                
                                    H
                                
                                
                                    (
                                    t
                                    )
                                
                            
                            =
                            
                                
                                    f
                                
                                
                                    H
                                
                            
                            (
                            
                                
                                    W
                                
                                
                                    H
                                
                            
                            
                                
                                    y
                                
                                
                                    
                                        
                                            t
                                            -
                                            1
                                        
                                    
                                
                            
                            +
                            
                                
                                    W
                                
                                
                                    I
                                
                            
                            
                                
                                    x
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                            )
                        
                    , where WO is the output weight matrix, the output is governed by                         
                            
                                
                                    y
                                
                                
                                    O
                                
                                
                                    (
                                    t
                                    )
                                
                            
                            =
                            
                                
                                    f
                                
                                
                                    O
                                
                            
                            (
                            
                                
                                    W
                                
                                
                                    O
                                
                            
                            
                                
                                    y
                                
                                
                                    H
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                            )
                        
                     [so in CW-RNN, only the state and output corresponding to the module i are updated] – Koutník, sec. 3, pp. 3-4).”  
Koutník and the instant application both relate to neural networks and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, Ghesu, and Heess to divide the states of the network into sub-states and manipulate the sub-states, as disclosed by Koutník, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would allow the system to capture long-term dependencies by allowing different parts of the network to run at different speeds.  See  Koutník, sec. 1, second paragraph.
Neither Léon, Dayan, Koutník, Ghesu, nor Heess appears to disclose explicitly the further limitations of the claim.  However, Williams discloses that “the goal recurrent neural network is a … long short-term memory (LSTM) neural network (in model for end-to end learning of task-oriented dialog systems, the main component is an LSTM, which maps from raw dialog history directly to a distribution over system actions – Williams, abstract; in task-oriented dialog systems, state tracking typically consists of tracking the user’s goal – id. at sec. 3.1, first paragraph).”  
Williams and the instant application both relate to neural networks and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, Ghesu, Heess, and Koutník to use an LSTM network as the goal RNN, as disclosed by Williams, and an ordinary artisan could reasonably have expected to do so successfully.  One motivation for doing so would be to infer automatically a representation of the history.  See Williams, abstract.
Neither Léon, Dayan, Koutník, Ghesu, Heess, nor Williams appears to disclose explicitly the further limitations of the claim.  However, Yu teaches a “dilated … neural network (module uses dilated convolutions to systematically aggregate multi-scale contextual information without losing resolution – Yu, abstract)”.  
Yu and the instant application both relate to neural networks and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, Koutník, Williams, Ghesu, and Heess to make the neural network a dilated network, as disclosed by Yu, and an ordinary artisan could reasonably have expected to do so successfully.  One motivation for doing so would be systematically to aggregate information without losing resolution.  See Yu, abstract.

Regarding claim 16, neither Léon, Dayan, Williams, Yu, Ghesu, nor Heess appears to disclose explicitly the further limitations of the claim.  However, Koutník discloses that “the time steps in the plurality of time steps are indexed starting from 1 for the first time step in the plurality of time steps to T for the last time step in the plurality of time steps (each of the modules is assigned a clock period                         
                            
                                
                                    T
                                
                                
                                    n
                                
                            
                            ∈
                            {
                            
                                
                                    T
                                
                                
                                    1
                                
                            
                            ,
                            …
                            ,
                            
                                
                                    T
                                
                                
                                    g
                                
                            
                            }
                        
                     – Koutník, sec. 3, first paragraph [g of Koutník = T of the claim]), wherein each sub-state is assigned an index ranging from 1 to r (neurons of the hidden layer are partitioned into g modules of size k – Koutník sec. 3, first paragraph [g = r]), and wherein selecting a sub-state from the r sub-states comprises: 
selecting the sub-state having an index that is equal to the index of the time step modulo r (at each CW-RNN time step t, only the output of modules i that satisfy (t MOD Ti) = 0 are executed – Koutník, sec. 3).”  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, Williams, Yu, Ghesu, and Heess to select a sub-state having an index equal to the index of the timestep modulo an integer, as disclosed by Koutník, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would allow the system to capture long-term dependencies by allowing different parts of the network to run at different speeds.  See  Koutník, sec. 1, second paragraph.

Regarding claim 18, neither Léon, Dayan, Williams, Yu, Ghesu, nor Heess appears to disclose explicitly the further limitations of the claim.  However, Koutník discloses that “processing current values of the selected sub-state and the network input for the time step using an LSTM neural network to update the current values of the selected sub-state and to generate a network output for the time step in accordance with current values of a set of LSTM network parameters comprises: 
setting an internal state of the LSTM neural network to the current values of the selected sub-state for the processing of the network input at the time step (at each forward pass time step, only the block-rows of the hidden layer weight matrix WH that correspond to the executed modules are used for evaluation, such that each block row                         
                            
                                
                                    W
                                
                                
                                    
                                        
                                            H
                                        
                                        
                                            i
                                        
                                    
                                
                            
                            =
                             
                            
                                
                                    W
                                
                                
                                    
                                        
                                            H
                                        
                                        
                                            i
                                        
                                    
                                
                            
                             
                            f
                            o
                            r
                             
                            
                                
                                    t
                                     
                                    m
                                    o
                                    d
                                     
                                    
                                        
                                            T
                                        
                                        
                                            i
                                        
                                    
                                
                            
                            =
                            0
                            ;
                            0
                             
                            o
                            t
                            h
                            e
                            r
                            w
                            i
                            s
                            e
                        
                     [so that, when the internal state y(t-1) is multiplied by WH, only the values corresponding to the sub-state are considered] – Koutník, section 3, pp. 3-4).”  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, Williams, Yu, Ghesu, and Heess to set an internal state of the network to the current values of the sub-state, as disclosed by Koutník, and an ordinary artisan could reasonably expect to have done so successfully.  Doing so would allow the system to capture long-term dependencies by allowing different parts of the network to run at different speeds.  See  Koutník, sec. 1, second paragraph.

Claims 14 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Léon in view of Dayan, Ghesu, Heess, Koutnik, Williams, and Yu and further in view of Dai.
Regarding claim 14, neither Léon, Dayan, Koutník, Yu, Williams, Heess, nor Ghesu appears to disclose explicitly the further limitations of the claim.  However, Dai discloses that “the dilated LSTM neural network is further configured to, for each of the time steps: 
pool the network output for the time step and the network outputs for up to a predetermined number of preceding time steps to generate a final network output for the time step (LSTM model trained with another video dataset is adopted and the average [pooled] output from all the time-steps of the last LSTM layers is used as the feature [final output] – Dai, sec. 1.1, last paragraph before “Conventional features”).”  
Dai and the instant application both relate to neural networks and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, Koutník, Williams, Yu, Heess, and Ghesu to pool the network outputs of previous time steps, as disclosed by Dai, and an ordinary artisan could reasonably have expected to do so successfully.  Doing so would reduce computational load and allow the system to generate an output under time constraints.  See Dai, sec. 1.1, last paragraph before “Conventional features”.

Regarding claim 17, neither Léon, Dayan, Koutník, Williams, Yu, Ghesu, nor Heess appears to disclose explicitly the further limitations of the claim.  However, Dai discloses that “the LSTM neural network comprises a plurality of LSTM layers (LSTM model trained with another video dataset is adopted and the average output from all the time-steps of the last LSTM layers is used as the feature – Dai, sec. 1.1, last paragraph before “Conventional features” [the plural “layers” implies that the network has multiple layers]).”  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, Koutník, Williams, Yu, Ghesu, and Heess to introduce multiple LSTM layers, as disclosed by Dai, and an ordinary artisan could reasonably have expected to do so successfully.  Doing so would allow the system to capture time dependencies more robustly than systems that use only a single LSTM layer.  See Dai, sec. 1.1, last paragraph before “Conventional features”.

Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Léon in view of Dayan, Koutník, Williams, Yu, Heess, Ghesu, and Dai and further in view of Cox (US 20180247199) (“Cox”).
Regarding claim 15, neither Léon, Dayan, Loutník, Williams, Yu, Dai, Heess, nor Ghesu appears to disclose explicitly the further limitations of the claim.  However, Cox discloses that “pooling the network outputs comprises summing the network outputs (output sequence at each time step may be linearly summed to obtain a first sum – Cox, paragraph 86).”2 
Cox and the instant application both relate to neural networks and are analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Léon, Dayan, Williams, Koutník, Yu, Dai, Heess, and Ghesu to sum the network outputs, as disclosed by Cox, and an ordinary artisan could reasonably have expected to do so successfully.  Doing so would reduce computational burden by allowing the outputs of multiple time steps to be combined into a single output.  See Cox, paragraph 86.

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The filing of a terminal disclaimer by itself is not a complete reply to a nonstatutory double patenting (NSDP) rejection. A complete reply requires that the terminal disclaimer be accompanied by a reply requesting reconsideration of the prior Office action. Even where the NSDP rejection is provisional the reply must be complete. See MPEP § 804, subsection I.B.1. For a reply to a non-final Office action, see 37 CFR 1.111(a). For a reply to final Office action, see 37 CFR 1.113(c). A request for reconsideration while not provided for in 37 CFR 1.113(c) may be filed after final for consideration. See MPEP §§ 706.07(e) and 714.13.
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The actual filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/apply/applying-online/eterminal-disclaimer.
Claims 1-16 and 18-22 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-16 and 19-20 of U.S. Patent No. 11,537,887. Although the claims at issue are not identical, they are not patentably distinct from each other because, with the exception of minor textual differences and different organization of claim dependencies, all of the limitations of the instant application are found in their counterparts in the reference patent.  A claim comparison chart follows.
Instant Application
Reference Patent
1. A system for selecting actions to be performed by an agent that interacts with an environment by performing actions from a predetermined set of actions, the system comprising one or more computers and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to implement: 
a manager neural network subsystem that is configured to, at each of a plurality of time steps: 
generate a latent representation, in a latent space, of a current state of the environment at the time step; 
generate, based at least in part on the latent representation of the current state of the environment at the time step, a goal vector that defines, in the latent state space, an objective to be accomplished as a result of actions performed by the agent in the environment; and 
a worker neural network subsystem that is configured to, at each of the plurality of time steps: 
generate a respective action score for each action in the predetermined set of actions based at least in part on the goal vector for the time step; and 
select an action from the predetermined set of actions to be performed by the agent at the time step using the action scores; and 
a training subsystem that is configured to perform operations comprising: 
determining a respective reward for each time step of the plurality of time steps, comprising, for one or more time steps: 
determining the reward for the time step based at least in part on a difference in direction between: (i) a vector representing a change in the latent representation of the state of the environment from a preceding time step to the time step, and (ii) the goal vector for the preceding time step; and 
training the worker neural network subsystem on the rewards using reinforcement learning techniques.
2. The system of claim 1, wherein at each of the plurality of time steps, the manager neural network subsystem is further configured to: update the goal vector for the time step by pooling the goal vector for the time step with goals vectors for one or more preceding time steps.
1. A system for selecting actions to be performed by an agent that interacts with an environment by performing actions from a predetermined set of actions, the system comprising one or more computers and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to implement: 
a manager neural network subsystem that is configured to, at each of a plurality of time steps: 
generate a latent representation, in a latent space, of a current state of the environment at the time step; 
generate, based at least in part on the latent representation of the current state of the environment at the time step, an initial goal vector that defines, in the latent state space, an objective to be accomplished as a result of actions performed by the agent in the environment; and 
pool the initial goal vector for the time step and initial goal vectors for one or more preceding time steps to generate a final goal vector for the time step, comprising:
generating the final goal vector for the time step by element-wise combining the initial goal vector for the time step and the initial goal vectors for one or more preceding time steps; 
a worker neural network subsystem that is configured to, at each of the plurality of time steps: 
generate a respective action score for each action in the predetermined set of actions based at least in part on the final goal vector for the time step; and 
select an action from the predetermined set of actions to be performed by the agent at the time step using the action scores.
16. The system of claim 1, further comprising a training subsystem that is configured to perform operations comprising: 
determining a respective reward for each time step of the plurality of time steps, comprising, for one or more time steps: 
determining the reward for the time step based at least in part on a difference in direction between: (i) a vector representing a change in the latent representation of the state of the environment from a preceding time step to the time step, and (ii) the initial goal vector for the preceding time step; and 
training the worker neural network subsystem on the rewards using reinforcement learning techniques.
3. The system of claim 1, wherein at each of the plurality of time steps, generating the goal vector, comprises: 
processing the latent representation using a goal recurrent neural network, wherein the goal recurrent neural network is configured to receive the latent representation and to process the latent representation in accordance with a hidden state of the goal recurrent neural network to generate the goal vector and to update the hidden state of the goal recurrent neural network.
2. The system of claim 1, wherein generating the initial goal vector, comprises: 
processing the latent representation using a goal recurrent neural network, wherein the goal recurrent neural network is configured to receive the latent representation and to process the latent representation in accordance with a hidden state of the goal recurrent neural network to generate the initial goal vector and to update the hidden state of the goal recurrent neural network.
4. The system of claim 1, wherein at each of the plurality of time steps, generating the respective action score for each action in the predetermined set of actions comprises: 
generating a respective action embedding vector in an embedding space for each action in the predetermined set of actions; 
projecting the goal vector for the time step to the embedding space to generate a goal embedding vector; and 
modulating the respective action embedding vector for each action by the goal embedding vector to generate the respective action score for each action in the predetermined set of actions.
8. The system of claim 1, wherein generating the respective action score for each action in the predetermined set of actions comprises: 
generating a respective action embedding vector in an embedding space for each action in the predetermined set of actions; 
projecting the final goal vector for the time step to the embedding space to generate a goal embedding vector; and 
modulating the respective action embedding vector for each action by the goal embedding vector to generate the respective action score for each action in the predetermined set of actions.
5. The system of claim 1, wherein selecting the action comprises selecting the action having a highest action score.
12. The system of claim 1, wherein selecting the action comprises selecting the action having a highest action score.
6. The system of claim 4, wherein generating the respective action embedding vector in the embedding space for each action in the predetermined set of actions comprises: 
processing a representation of the current state of the environment using an action score recurrent neural network, in accordance with a hidden state of the action score recurrent neural network, to generate the action embedding vectors and to update the hidden state of the action score recurrent neural network.
9. The system of claim 8, wherein generating the respective action embedding vector in the embedding space for each action in the predetermined set of actions comprises: 
processing a representation of the current state of the environment using an action score recurrent neural network, in accordance with a hidden state of the action score recurrent neural network, to generate the action embedding vectors and to update the hidden state of the action score recurrent neural network.
7. The system of claim 4, wherein the goal vector has a higher dimensionality than the goal embedding vector.
10. The system of claim 8, wherein the final goal vector has a higher dimensionality than the goal embedding vector.
8. The system of claim 7, wherein the dimensionality of the goal vector is at least ten times higher than the dimensionality of the goal embedding vector.
11. The system of claim 10, wherein the dimensionality of the final goal vector is at least ten times higher than the dimensionality of the goal embedding vector.
9. The system of claim 1, wherein determining the respective reward for each time step of the plurality of time steps comprises, for one or more time steps: 
receiving an external reward for the time step as a result of the agent performing selected actions; and 
determining the reward for the time step based at least in part on the external reward for the time step.  
10. The system of claim 1, wherein training the worker neural network subsystem on the rewards using reinforcement learning techniques comprises: 
training the worker neural network subsystem to generate action scores that maximize a time discounted combination of rewards.
13. The system of claim 1, wherein the worker neural network subsystem has been trained to generate action scores that maximize a time discounted combination of rewards, wherein each reward is a combination of an external reward received as a result of the agent performing selected actions and an intrinsic reward dependent upon goal vectors generated by the manager neural network subsystem.  

11. The system of claim 9, wherein the operations performed by the training subsystem further comprise: 
training the manager neural network subsystem to generate goal vectors that result in action scores that encourage selection of actions that increase the external rewards received as a result of the agent performing the selected actions.
14. The system of claim 13, wherein the manager neural network subsystem has been trained to generate initial goal vectors that result in action scores that encourage selection of actions that increase the external rewards received as a result of the agent performing the selected actions.
12. The system of claim 1, wherein generating the latent representation, in the latent space, of the current state of the environment at the time step comprises: 
processing an observation characterizing the current state of the environment using a convolutional neural network.
15. The system of claim 1, wherein generating the latent representation, in the latent space, of the current state of the environment at the time step comprises:
processing an observation characterizing the current state of the environment using a convolutional neural network.
13. The system of claim 3, wherein the goal recurrent neural network is a dilated long short- term memory (LSTM) neural network, wherein the dilated LSTM neural network is configured to maintain an internal state that is partitioned into r sub-states, wherein r is an integer greater than one, and wherein the dilated LSTM neural network is configured to, at each time step in the plurality of time steps: 
receive a network input for the time step; 
select a sub-state from the r sub-states; and 
process current values of the selected sub-state and the network input for the time step using an LSTM neural network to update the current values of the selected sub-state and to generate a network output for the time step in accordance with current values of a set of LSTM network parameters.
3. The system of claim 2, wherein the goal recurrent neural network is a dilated long short-term memory (LSTM) neural network, wherein the dilated LSTM neural network is configured to maintain an internal state that is partitioned into r sub-states, wherein r is an integer greater than one, and wherein the dilated LSTM neural network is configured to, at each time step in the plurality of time steps:
receive a network input for the time step;
select a sub-state from the r sub-states; and
process current values of the selected sub-state and the network input for the time step using an LSTM neural network to update the current values of the selected sub-state and to generate a network output for the time step in accordance with current values of a set of LSTM network parameters.
 14. The system of claim 13, wherein the dilated LSTM neural network is further configured to, for each of the time steps: 
pool the network output for the time step and the network outputs for up to a predetermined number of preceding time steps to generate a final network output for the time step.  
4. The system of claim 3, wherein the dilated LSTM neural network is further configured to, for each of the time steps:
pool the network output for the time step and the network outputs for up to a predetermined number of preceding time steps to generate a final network output for the time step.
15. The system of claim 14, wherein pooling the network outputs comprises summing the network outputs.
5. The system of claim 4, wherein pooling the network outputs comprises summing the network outputs.
16. The system of claim 13, wherein the time steps in the plurality of time steps are indexed starting from 1 for the first time step in the plurality of time steps to T for the last time step in the plurality of time steps, wherein each sub-state is assigned an index ranging from 1 to r, and wherein selecting a sub-state from the r sub-states comprises: 
selecting the sub-state having an index that is equal to the index of the time step modulo r.
6. The system of claim 3, wherein the time steps in the plurality of time steps are indexed starting from 1 for the first time step in the plurality of time steps to T for the last time step in the plurality of time steps, wherein each sub-state is assigned an index ranging from 1 to r, and wherein selecting a sub-state from the r sub-states comprises:
selecting the sub-state having an index that is equal to the index of the time step modulo r.
18. The system of claim 13, wherein processing current values of the selected sub-state and the network input for the time step using an LSTM neural network to update the current values of the selected sub-state and to generate a network output for the time step in accordance with current values of a set of LSTM network parameters comprises: 
setting an internal state of the LSTM neural network to the current values of the selected sub-state for the processing of the network input at the time step.  
7. The system of claim 3, wherein processing current values of the selected sub-state and the network input for the time step using an LSTM neural network to update the current values of the selected sub-state and to generate a network output for the time step in accordance with current values of a set of LSTM network parameters comprises:
setting an internal state of the LSTM neural network to the current values of the selected sub-state for the processing of the network input at the time step.
19 (22). One or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations for selecting actions to be performed by an agent that interacts with an environment by performing actions from a predetermined set of actions, the operations comprising: 
at each of a plurality of time steps: 
generating, by a manager neural network system, a latent representation, in a latent space, of a current state of the environment at the time step; 
generating, by the manager neural network system, based at least in part on the latent representation of the current state of the environment at the time step, a goal vector that defines, in the latent state space, an objective to be accomplished as a result of actions performed by the agent in the environment; 
generating, by a worker neural network system, a respective action score for each action in the predetermined set of actions based at least in part on the goal vector for the time step; and 
selecting, by a worker neural network system, an action from the predetermined set of actions to be performed by the agent at the time step using the action scores; and 
training the manager neural network system and the worker neural network system, comprising: 
determining a respective reward for each time step of the plurality of time steps, comprising, for one or more time steps:
determining the reward for the time step based at least in part on a difference in direction between: (i) a vector representing a change in the latent representation of the state of the environment from a preceding time step to the time step, and (ii) the goal vector for the preceding time step; and 
training the worker neural network system on the rewards using reinforcement learning techniques.
19. One or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations for selecting actions to be performed by an agent that interacts with an environment by performing actions from a predetermined set of actions, the operations comprising, at each of a plurality of time steps:
generating a latent representation, in a latent space, of a current state of the environment at the time step;
generating, based at least in part on the latent representation of the current state of the environment at the time step, an initial goal vector that defines, in the latent state space, an objective to be accomplished as a result of actions performed by the agent in the environment;
pooling the initial goal vector for the time step and initial goal vectors for one or more preceding time steps to generate a final goal vector for the time step, comprising:
generating the final goal vector for the time step by element-wise combining the initial goal vector for the time step and the initial goal vectors for one or more preceding time steps;
generating a respective action score for each action in the predetermined set of actions based at least in part on the final goal vector for the time step; and
selecting an action from the predetermined set of actions to be performed by the agent at the time step using the action scores.
16. The system of claim 1, further comprising a training subsystem that is configured to perform operations comprising: 
determining a respective reward for each time step of the plurality of time steps, comprising, for one or more time steps: 
determining the reward for the time step based at least in part on a difference in direction between: (i) a vector representing a change in the latent representation of the state of the environment from a preceding time step to the time step, and (ii) the initial goal vector for the preceding time step; and 
training the worker neural network subsystem on the rewards using reinforcement learning techniques.
20 (21). A method performed by one or more data processing apparatus for selecting actions to be performed by an agent that interacts with an environment by performing actions from a predetermined set of actions, the method comprising: 
at each of a plurality of time steps: 
generating, by a manager neural network system, a latent representation, in a latent space, of a current state of the environment at the time step; 
generating, by the manager neural network system, based at least in part on the latent representation of the current state of the environment at the time step, a goal vector that defines, in the latent state space, an objective to be accomplished as a result of actions performed by the agent in the environment; 
generating, by a worker neural network system, a respective action score for each action in the predetermined set of actions based at least in part on the goal vector for the time step; and 
selecting, by a worker neural network system, an action from the predetermined set of actions to be performed by the agent at the time step using the action scores; and 
training the manager neural network system and the worker neural network system, comprising: 
determining a respective reward for each time step of the plurality of time steps, comprising, for one or more time steps: 
determining the reward for the time step based at least in part on a difference in direction between: (i) a vector representing a change in the latent representation of the state of the environment from a preceding time step to the time step, and (ii) the goal vector for the preceding time step; and 
training the worker neural network system on the rewards using reinforcement learning techniques.
20. A method performed by one or more data processing apparatus for selecting actions to be performed by an agent that interacts with an environment by performing actions from a predetermined set of actions, the method comprising, at each of a plurality of time steps:
generating a latent representation, in a latent space, of a current state of the environment at the time step;
generating, based at least in part on the latent representation of the current state of the environment at the time step, an initial goal vector that defines, in the latent state space, an objective to be accomplished as a result of actions performed by the agent in the environment;
pooling the initial goal vector for the time step and initial goal vectors for one or more preceding time steps to generate a final goal vector for the time step, comprising:
generating the final goal vector for the time step by element-wise combining the initial goal vector for the time step and the initial goal vectors for one or more preceding time steps;
generating a respective action score for each action in the predetermined set of actions based at least in part on the final goal vector for the time step; and
selecting an action from the predetermined set of actions to be performed by the agent at the time step using the action scores.
16. The system of claim 1, further comprising a training subsystem that is configured to perform operations comprising: 
determining a respective reward for each time step of the plurality of time steps, comprising, for one or more time steps: 
determining the reward for the time step based at least in part on a difference in direction between: (i) a vector representing a change in the latent representation of the state of the environment from a preceding time step to the time step, and (ii) the initial goal vector for the preceding time step; and 
training the worker neural network subsystem on the rewards using reinforcement learning techniques.


Claims 1, 3-8, and 10-22 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-4, 6-12, 15-18, and 20 of U.S. Patent No. 10,679,126 (“reference patent 2”) in view of Ghesu.  A claim comparison chart follows, followed by an analysis.
Instant Application
Reference Patent 2
1. A system for selecting actions to be performed by an agent that interacts with an environment by performing actions from a predetermined set of actions, the system comprising one or more computers and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to implement: 
a manager neural network subsystem that is configured to, at each of a plurality of time steps: 
generate a latent representation, in a latent space, of a current state of the environment at the time step; 
generate, based at least in part on the latent representation of the current state of the environment at the time step, an initial goal vector that defines, in the latent state space, an objective to be accomplished as a result of actions performed by the agent in the environment; and 
a worker neural network subsystem that is configured to, at each of the plurality of time steps: 
generate a respective action score for each action in the predetermined set of actions based at least in part on the final goal vector for the time step; and 
select an action from the predetermined set of actions to be performed by the agent at the time step using the action scores; and
a training subsystem that is configured to perform operations comprising: 
determining a respective reward for each time step of the plurality of time steps, comprising, for one or more time steps: 
determining the reward for the time step based at least in part on a difference in direction between: (i) a vector representing a change in the latent representation of the state of the environment from a preceding time step to the time step, and (ii) the goal vector for the preceding time step; and 
training the worker neural network subsystem on the rewards using reinforcement learning techniques.
3. The system of claim 1, wherein generating the initial goal vector, comprises: 
processing the latent representation using a goal recurrent neural network, wherein the goal recurrent neural network is configured to receive the latent representation and to process the latent representation in accordance with a hidden state of the goal recurrent neural network to generate the initial goal vector and to update the hidden state of the goal recurrent neural network.
4. The system of claim 1, wherein at each of the plurality of time steps, generating the respective action score for each action in the predetermined set of actions comprises: 
generating a respective action embedding vector in an embedding space for each action in the predetermined set of actions; 
projecting the final goal vector for the time step to the embedding space to generate a goal embedding vector; and 
modulating the respective action embedding vector for each action by the goal embedding vector to generate the respective action score for each action in the predetermined set of actions.  
1.  A system for selecting actions to be performed by an agent that interacts with an environment by performing actions from a predetermined set of actions, the system comprising one or more computers and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to implement: 
a manager neural network subsystem that is configured to, at each of a plurality of time steps: 
receive an intermediate representation of a current state of the environment at the time step, 
map the intermediate representation to a latent representation of the current state in a latent state space, 
process the latent representation using a goal recurrent neural network, wherein the goal recurrent neural network is configured to receive the latent representation and to process the latent representation in accordance with a current hidden state of the goal recurrent neural network to generate an initial goal vector in a goal space for the time step and to update an internal state of the goal recurrent neural network, wherein the initial goal vector defines, in the latent state space, an objective to be accomplished as a result of actions performed by the agent in the environment, and 
pool the initial goal vector for the time step and initial goal vectors for one or more preceding time steps to generate a final goal vector for the time step; 
a worker neural network subsystem that is configured to, at each of the plurality of time steps: 
receive the intermediate representation of the current state of the environment at the time step, 
map the intermediate representation to a respective action embedding vector in anPage: 8 of 17 embedding space for each action in the predetermined set of actions, 
project the final goal vector for the time step from the goal space to the embedding space to generate a goal embedding vector, and 
modulate the respective action embedding vector for each action by the goal embedding vector to generate a respective action score for each action in the predetermined set of actions; and 
an action selection subsystem, wherein the action selection subsystem is configured to, at each of the plurality of time steps: 
receive an observation characterizing the current state of the environment at the time step, 
generate the intermediate representation from the observation, 
provide the intermediate representation as input to the manager neural network to generate the final goal vector for the time step, 
provide the intermediate representation and the final goal vector as input to the worker neural network to generate the action scores, and 
select an action from the predetermined set of actions to be performed by the agent in response to the observation using the action scores.
5. The system of claim 1, wherein selecting the action comprises selecting the action having a highest action score.
2. The system of claim 1, wherein selecting the action comprises selecting the action having a highest action score.
6. The system of claim 4, wherein generating the respective action embedding vector in the embedding space for each action in the predetermined set of actions comprises:
processing a representation of the current state of the environment using an action score recurrent neural network, in accordance with a hidden state of the action score recurrent neural network, to generate the action embedding vectors and to update the hidden state of the action score recurrent neural network.
4. The system of  claim 1, wherein mapping the intermediate representation to a respective action embedding vector in an embedding space for each action in the predetermined set of actions comprises: 
processing the intermediate representation using an action score recurrent neural network, wherein the action score recurrent neural network is configured to receive the intermediate representation and to process the intermediate representation in accordance with a current hidden state of the action score recurrent neural network to generate the action embedding vectors and to update the hidden state of the action score neural network.
7. The system of claim 4, wherein the goal vector has a higher dimensionality than the goal embedding vector.
6. The system of claim 1, wherein the goal space has a higher dimensionality than the embedding space.
8. The system of claim 7, wherein the dimensionality of the goal vector is at least ten times higher than the dimensionality of the goal embedding vector.
7. The system of claim 6, wherein the dimensionality of the goal space is at least ten times higher than the dimensionality of the embedding space.
10. The system of claim 1, wherein training the worker neural network on the rewards using reinforcement learning techniques comprises: training the worker neural network subsystem to generate action scores that maximize a time discounted combination of rewards.
8. The system of claim 1, wherein the worker neural network subsystem has been trained to generate action scores that maximize a time discounted combination of rewards, wherein each reward is a combination of an external reward received as a result of the agent performing the selected action and an intrinsic reward dependent upon the goal vectors generated by the manager neural network subsystem.
11. The system of claim 9, wherein the operations performed by the training subsystem further comprise:
training the manager neural network subsystem to generate goal vectors that result in action scores that encourage selection of actions that increase the external rewards received as a result of the agent performing the selected actions.
9. The system of claim 8, wherein the manager neural network subsystem has been trained to generate initial goal vectors that result in action scores that encourage selection of actions that increase the external rewards received as a result of the agent performing the selected actions.
12. The system of claim 1, wherein generating the latent representation, in the latent space, of the current state of the environment at the time step comprises: 
processing an observation characterizing the current state of the environment using a convolutional neural network.
3. The system of claim 1, wherein generating the intermediate representation from the observation comprises processing the observation using a convolutional neural network.
13. The system of claim 3, wherein the goal recurrent neural network is a dilated long short- term memory (LSTM) neural network, wherein the dilated LSTM neural network is configured to maintain an internal state that is partitioned into r sub-states, wherein r is an integer greater than one, and wherein the dilated LSTM neural network is configured to, at each time step in the plurality of time steps: 
receive a network input for the time step; 
select a sub-state from the r sub-states; and 
process current values of the selected sub-state and the network input for the time step using an LSTM neural network to update the current values of the selected sub-state and to generate a network output for the time step in accordance with current values of a set of LSTM network parameters.
10. The system of claim 1, wherein the goal recurrent neural network is a dilated long short-term memory (LSTM) neural network, wherein the dilated LSTM neural network is configured to maintain an internal state that is partitioned into r sub-states, wherein r is an integer greater than one, and wherein the dilated LSTM neural network is configured to, at each time step in the plurality of time steps: 
receive a network input for the time step; 
select a sub-state from the r sub-states; and 
process current values of the selected sub-state and the network input for the time step using an LSTM neural network to update the current values of the selected sub-state and to generate a network output for the time step in accordance with current values of a set of LSTM network parameters.
14. The system of claim 13, wherein the dilated LSTM neural network is further configured to, for each of the time steps:  28Attorney Docket No. 45288-8225002 
pool the network output for the time step and the network outputs for up to a predetermined number of preceding time steps to generate a final network output for the time step.
11. The system of claim 10, wherein the dilated LSTM neural network is further configured to, for each of the time steps: 
pool the network output for the time step and the network outputs for up to a predetermined number of preceding time steps to generate a final network output for the time step.
15. The system of claim 14, wherein pooling the network outputs comprises summing the network outputs.
12. The system of claim 11, wherein pooling the network outputs comprises summing the network outputs.
16. The system of claim 13, wherein the time steps in the plurality of time steps are indexed starting from 1 for the first time step in the plurality of time steps to T for the last time step in the plurality of time steps, wherein each sub-state is assigned an index ranging from 1 to r, and wherein selecting a sub-state from the r sub-states comprises: 
selecting the sub-state having an index that is equal to the index of the time step modulo r.
15.  The system of claim 10, wherein the time steps in the plurality of time steps are indexed starting from 1 for the first time step in the plurality of time steps to T for the last time step in the plurality of time steps, wherein each sub-state is assigned an index ranging from 1 to r, and wherein selecting a sub-state from the r sub-states comprises: 
selecting the sub-state having an index that is equal to the index of the time step modulo r.
17. The system of claim 13, wherein the LSTM neural network comprises a plurality of LSTM layers.
16. The system of claim 10, wherein the LSTM neural network comprises a plurality of LSTM layers.
18. The system of claim 13, wherein processing current values of the selected sub-state and the network input for the time step using an LSTM neural network to update the current values of the selected sub-state and to generate a network output for the time step in accordance with current values of a set of LSTM network parameters comprises: 
setting an internal state of the LSTM neural network to the current values of the selected sub-state for the processing of the network input at the time step.
17. The system of claim 10, wherein processing current values of the selected sub-state and the network input for the time step using an LSTM neural network to update the current values of the selected sub-state and to generate a network output for the time step in accordance with current values of a set of LSTM network parameters comprises: 
setting an internal state of the LSTM neural network to the current values of the selected sub-state for the processing of the network input at the time step.
19 (22). One or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations for 29Attorney Docket No. 45288-8225002 selecting actions to be performed by an agent that interacts with an environment by performing actions from a predetermined set of actions, the operations comprising: at each of a plurality of time steps: 
generating, by a manager neural network system, a latent representation, in a latent space, of a current state of the environment at the time step; 
generating, by the manager neural network system, based at least in part on the latent representation of the current state of the environment at the time step, a goal vector that defines, in the latent state space, an objective to be accomplished as a result of actions performed by the agent in the environment; 
generating, by a worker neural network system, a respective action score for each action in the predetermined set of actions based at least in part on the final goal vector for the time step; and 
selecting, by a worker neural network system, an action from the predetermined set of actions to be performed by the agent at the time step using the action scores; and
training the manager neural network system and the worker neural network system, comprising: 
determining a respective reward for each time step of the plurality of time steps, comprising, for one or more time steps:
determining the reward for the time step based at least in part on a difference in direction between: (i) a vector representing a change in the latent representation of the state of the environment from a preceding time step to the time step, and (ii) the goal vector for the preceding time step; and 
training the worker neural network system on the rewards using reinforcement learning techniques.
18.  One or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations for selecting actions to be performed by an agent that interacts with an environment by performing actions from a predetermined set of actions, the operations comprising, at each of a plurality of time steps: 
receiving an observation characterizing a current state of the environment at the time step; 
generating an intermediate representation of the current state of the environment at the time step from the observation; 
mapping the intermediate representation to a latent representation of the current state in a latent state space; 
processing the latent representation using a goal recurrent neural network, wherein the goal recurrent neural network is configured to receive the latent representation and to process the latent representation in accordance with a current hidden state of the goal recurrent neural network to generate an initial goal vector in a goal space for the time step and to update an internal state of the goal recurrent neural network, wherein the initial goal vector defines, in the latent state space, an objective to be accomplished as a result of actions performed by the agent in the environment; 
pooling the initial goal vector for the time step and initial goal vectors for one or more preceding time steps to generate a final goal vector for the time step; 
mapping the intermediate representation to a respective action embedding vector in an embedding space for each action in the predetermined set of actions; 
projecting the final goal vector for the time step from the goal space to the embedding space to generate a goal embedding vector; 
modulating the respective action embedding vector for each action by the goal embedding vector to generate a respective action score for each action in the predetermined set of actions; and 
selecting an action from the predetermined set of actions to be performed by the agent in response to the observation using the action scores.
20 (21). A method performed by one or more data processing apparatus for selecting actions to be performed by an agent that interacts with an environment by performing actions from a predetermined set of actions, the method comprising: at each of a plurality of time steps: 
generating, by a manager neural network system, a latent representation, in a latent space, of a current state of the environment at the time step; 
generating, by the manager neural network system, based at least in part on the latent representation of the current state of the environment at the time step, a goal vector that defines, in the latent state space, an objective to be accomplished as a result of actions performed by the agent in the environment; 
generating, by a worker neural network system, a respective action score for each action in the predetermined set of actions based at least in part on the goal vector for the time step; and 
selecting, by [the] worker neural network system, an action from the predetermined set of actions to be performed by the agent at the time step using the action scores; and
training the manager neural network system and the worker neural network system, comprising: 
determining a respective reward for each time step of the plurality of time steps, comprising, for one or more time steps:
determining the reward for the time step based at least in part on a difference in direction between: (i) a vector representing a change in the latent representation of the state of the environment from a preceding time step to the time step, and (ii) the goal vector for the preceding time step; and 
training the worker neural network system on the rewards using reinforcement learning techniques.
20. A method performed by one or more data processing apparatus for selecting actions to be performed by an agent that interacts with an environment by performing actions from a predetermined set of actions, the method comprising, at each of a plurality of time steps: 
receiving an observation characterizing a current state of the environment at the time step; 
generating an intermediate representation of the current state of the environment at the time step from the observation; 
mapping the intermediate representation to a latent representation of the current state in a latent state space; 
processing the latent representation using a goal recurrent neural network, wherein the goal recurrent neural network is configured to receive the latent representation and to process the latent representation in accordance with a current hidden state of the goal recurrent neural network to generate an initial goal vector in a goal space for the time step and to update an internal state of the goal recurrent neural network, wherein the initial goal vector defines, in the latent state space, an objective to be accomplished as a result of actions performed by the agent in the environment; 
pooling the initial goal vector for the time step and initial goal vectors for one or more preceding time steps to generate a final goal vector for the time step; 
mapping the intermediate representation to a respective action embedding vector in an embedding space for each action in the predetermined set of actions; 
projecting the final goal vector for the time step from the goal space to the embedding space to generate a goal embedding vector; 
modulating the respective action embedding vector for each action by the goal embedding vector to generate a respective action score for each action in the predetermined set of actions; and 
selecting an action from the predetermined set of actions to be performed by the agent in response to the observation using the action scores.


As can be seen from the foregoing, reference patent 2 contains all the limitations contained in their counterpart instant claims with the exception of the functions of the training subsystem.  These functions are disclosed by Ghesu, as shown above in the rejection under 35 USC § 103.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified reference patent 2 to determine the reward based on a difference between directions of vectors, as disclosed by Ghesu, for substantially the same reasons as given in the rejection of claim 1 under § 103.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to RYAN C VAUGHN whose telephone number is (571)272-4849. The examiner can normally be reached M-R 7:50a-5:50p ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar, can be reached at 571-272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/RYAN C VAUGHN/             Primary Examiner, Art Unit 2125


    
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
    

    
        1 The English translation of the Chinese office action submitted with the IDS of February 6, 2026, while entertaining, is nonsensical.  To the extent that Applicant wishes for Examiner to take this action into consideration in more than a perfunctory fashion, Applicant is advised to submit a more intelligible translation.
        2 The provisional application filed February 24, 2017 does not appear to disclose this limitation.  Therefore, this claim is only entitled to the filing date of the PCT application, i.e., February 19, 2018.
Read full office action
Prosecution Timeline

Nov 30, 2022
Application Filed
Apr 06, 2026
Non-Final Rejection — §101, §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/304,163
Patent 12602448
PROGRESSIVE NEURAL ORDINARY DIFFERENTIAL EQUATIONS
2y 5m to grant Granted Apr 14, 2026
17/465,916
Patent 12602610
CLASSIFICATION BASED ON IMBALANCED DATASET
2y 5m to grant Granted Apr 14, 2026
17/227,817
Patent 12561583
Systems and Methods for Machine Learning in Hyperbolic Space
2y 5m to grant Granted Feb 24, 2026
17/730,148
Patent 12541703
MULTITASKING SCHEME FOR QUANTUM COMPUTERS
2y 5m to grant Granted Feb 03, 2026
17/830,142
Patent 12511526
METHOD FOR PREDICTING A MOLECULAR STRUCTURE
2y 5m to grant Granted Dec 30, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
62%
Grant Probability
81%
With Interview (+19.4%)
3y 9m
Median Time to Grant
Low
PTA Risk
Based on 235 resolved cases by this examiner. Grant probability derived from career allow rate.