Last updated: April 19, 2026
Application No. 17/794,797
PLANNING FOR AGENT CONTROL USING LEARNED HIDDEN STATES

Non-Final OA §101§103§112
Filed
Jul 22, 2022
Examiner
SMITH, KEVIN LEE
Art Unit
2122
Tech Center
2100 — Computer Architecture & Software
Assignee
Deepmind Technologies Limited
OA Round
1 (Non-Final)
This examiner grants 37% of cases after interview

— +18.0% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 134 resolved cases, 2023–2026
Examiner Intelligence

SMITH, KEVIN LEE View full profile →
Grants only 37% of cases
Career Allow Rate
49 granted / 134 resolved
-18.4% vs TC avg
Strong +18% interview lift
Without
With
+18.0%
Interview Lift
resolved cases with interview
Typical timeline
4y 8m
Avg Prosecution
45 currently pending
Career history
179
Total Applications
across all art units
Statute-Specific Performance

§101
30.7%
-9.3% vs TC avg
§103
36.4%
-3.6% vs TC avg
§102
10.1%
-29.9% vs TC avg
§112
17.3%
-22.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 134 resolved cases
Office Action

§101 §103 §112
DETAILED ACTION
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

2.	This communication is in response to the Applicant’s submission filed 22 July 2022, where:
Claims 1-10 have been cancelled via preliminary amendment filed 22 July 2022.
Claims 13, 15, 17-19, 20, 23, 24, are amended via preliminary amendment filed 22 July 2022.
New claims 25-30 are presented for examination via preliminary amendment filed 22 July 2022.
Claims 11-30 are pending.
Claims 11-30 are rejected.
Foreign priority is claimed to GR20200100037, filed 28 January 2020. A certified copy of this paper has been filed on 22 July 2022. Accordingly, receipt is acknowledged of certified copies of papers required by 37 CFR 1.55. 
Information Disclosure Statement
3	Information disclosure statements were submitted on 13 December 2023, 23 January 2024, 26 August 2024, 25 March 2025, 02 June 2025, and 02 September 2025. The submissions comply with the provisions of 37 CFR 1.97. Accordingly, the Examiner considered the information disclosure statements.
Claim Rejections - 35 U.S.C. § 112
4.	The following is a quotation of 35 U.S.C. § 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
5.	Claims 12, 17, 18, and 19 are rejected under 35 U.S.C. § 112(b) or 35 U.S.C. § 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. § 112, the Applicant), regards as the invention.
Claim 12, line 4, recites “the immediate reward value.” There is insufficient antecedent basis for this limitation in the claim.
Claim 17, line 5, recites “the predicted immediate reward.” There is insufficient antecedent basis for this limitation in the claim.
Claim 18, line 3, recites “the hidden states.” There is insufficient antecedent basis for this limitation in the claim.
Claim 19, lines 1 & 2, recites “the actual return.” There is insufficient antecedent basis for this limitation in the claim.
Claim 19, line 2, recites “the subsequent state.” There is insufficient antecedent basis for this limitation in the claim.
Claim Rejections - 35 U.S.C. § 101
6.	35 U.S.C. § 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
7.	Claims 11-30 are rejected under 35 U.S.C. § 101 because the claimed invention is directed to an abstract idea without significantly more. 
Claim 11 recites a method, which is a process, and thus one of the statutory categories of patentable subject matter. (35 U.S.C. § 101). 
However, under Step 2A Prong One, the claim recites the limitations of “[(b)] performing a plurality of planning iterations to generate plan data that indicates a respective value to performing the task of the agent performing each of the set of actions in the environment and starting from the current environment state,” and “[(c)] selecting, from the set of actions, an action to be performed by the agent in response to the current observation based on the generated plan data.” The activities of “[(b)] performing . . . planning iterations,” and “[(c)] selecting” can practically be performed in the human mind, including, for example, observations, evaluations, judgments, and opinions, and accordingly, are a mental process, (MPEP § 2106.04(a)(2) sub III), which is one of the groupings of abstract ideas. (MPEP § 2106.04(a)(2)). 
The claim recites more details or specifics to the abstract idea of “[(b)] performing a plurality of planning iterations,” where “performing each planning iteration comprises: [(b.1)] selecting a sequence of actions to be performed by the agent starting from the current environment state based on outputs generated by: [(b.1.1)] (i) a dynamics model . . . ; and [(b.1.2)] (ii) a prediction model . . . ,” and accordingly, is merely more specific to the abstract idea. Moreover, the claim recites “[(b.1)] selecting . . . based on outputs generated by: [(b.1.1)] a dynamics model . . . ; and [(b.1.2)] a prediction model . . . ,” where the plain meaning of such “outputs” refers to predictions or decisions made by a machine learning model based on input data. The broadest reasonable interpretation of such “outputs” cover predictive activities by a human user, (see Specification at p. 18, ll. 22-24 (“agent that is being controlled by a human user”)), and can practically be performed in the human mind, including, for example, observations, evaluations, judgments, and opinions, and accordingly, are a mental process, (MPEP § 2106.04(a)(2) sub III), which is one of the groupings of abstract ideas. (MPEP § 2106.04(a)(2)). Accordingly, claim 10 is directed to an abstract idea.
Under Step 2A Prong Two, the claim as a whole is not integrated into a practical application, because the additional elements recited in the claim beyond the identified judicial exception include “a dynamics model,” and “a prediction model,” which are described at a high level of generality, and accordingly, are generic computer components used to implement the abstract idea. (MPEP § 2106.05(f); Specification at p. 11, ll. 31-33 (“The representation, dynamics, and prediction models can each be implemented as a respective neural network with any appropriate neural network architecture that enables it to perform its described function.”)), and do not serve to integrate the abstract idea into a practical application. Also, the claim recites “[(a)] receiving a current observation characterizing a current environment state of the environment,” which is the pre-processing insignificant extra-solution activity of data gathering, (MPEP § 2106.05(g)), that does not serve to integrate the abstract idea into a practical application. Thus, claim 10 is directed to an abstract idea.
Finally, under Step 2B, the additional elements, taken alone or in combination, do not represent significantly more than the abstract idea itself. The claim includes the elements of “a dynamics model,” and “a prediction model,” which are described at a high level of generality, and accordingly, are generic computer components used to implement the abstract idea. (MPEP § 2106.05(f); Specification at p. 11, ll. 31-33 (“The representation, dynamics, and prediction models can each be implemented as a respective neural network with any appropriate neural network architecture that enables it to perform its described function.”)), and do not amount to significantly more than the abstract idea. Also, the claim recites “[(a)] receiving a current observation characterizing a current environment state of the environment,” which is the well-understood, routine, and conventional activity of receiving data over a network, (MPEP § 2106.05(d) sub II.i), that does not amount to significantly more than the abstract idea. Therefore, claim 10 is subject-matter ineligible.
Claim 23 recites a system, which is a product, and thus one of the statutory categories of patentable subject matter. (35 U.S.C. § 101). 
However, under Step 2A Prong One, the claim recites the limitations of “[(b)] performing a plurality of planning iterations to generate plan data that indicates a respective value to performing the task of the agent performing each of the set of actions in the environment and starting from the current environment state,” and “[(c)] selecting, from the set of actions, an action to be performed by the agent in response to the current observation based on the generated plan data.” The activities of “[(b)] performing . . . planning iterations,” and “[(c)] selecting” can practically be performed in the human mind, including, for example, observations, evaluations, judgments, and opinions, and accordingly, are a mental process, (MPEP § 2106.04(a)(2) sub III), which is one of the groupings of abstract ideas. (MPEP § 2106.04(a)(2)). 
The claim recites more details or specifics to the abstract idea of “[(b)] performing a plurality of planning iterations,” where “performing each planning iteration comprises: [(b.1)] selecting a sequence of actions to be performed by the agent starting from the current environment state based on outputs generated by: [(b.1.1)] (i) a dynamics model . . . ; and [(b.1.2)] (ii) a prediction model . . . ,” and accordingly, is merely more specific to the abstract idea. Moreover, the claim recites “[(b.1)] selecting . . . based on outputs generated by: [(b.1.1)] a dynamics model . . . ; and [(b.1.2)] a prediction model . . . ,” where the plain meaning of the claimed “outputs” refers to predictions or decisions made by a machine learning model based on input data. The broadest reasonable interpretation of the claimed “outputs” cover predictive activities by a human user, (see, e.g., Specification at p. 18, ll. 22-24 (“agent that is being controlled by a human user”)), and can practically be performed in the human mind, including, for example, observations, evaluations, judgments, and opinions, and accordingly, are a mental process, (MPEP § 2106.04(a)(2) sub III), which is one of the groupings of abstract ideas. (MPEP § 2106.04(a)(2)). Accordingly, claim 23 is directed to an abstract idea.
Under Step 2A Prong Two, the claim as a whole is not integrated into a practical application, because the additional elements recited in the claim beyond the identified judicial exception include “[a] system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations,” in which the use of generic computer components (one or more computers, one or more storage devices) to execute instructions to implement the abstract idea, (MPEP § 2106.05(f)), that does not serve to integrate the abstract idea into a practical application. Also, the claim recites additional elements of “a dynamics model,” and “a prediction model,” which are described at a high level of generality, and accordingly, are generic computer components used to implement the abstract idea. (MPEP § 2106.05(f); see also Specification at p. 11, ll. 31-33 (“The representation, dynamics, and prediction models can each be implemented as a respective neural network with any appropriate neural network architecture that enables it to perform its described function.”)), and do not serve to integrate the abstract idea into a practical application. Also, the claim recites “[(a)] receiving a current observation characterizing a current environment state of the environment,” which is the pre-processing insignificant extra-solution activity of data gathering, (MPEP § 2106.05(g)), that does not serve to integrate the abstract idea into a practical application. Thus, claim 23 is directed to an abstract idea.
Finally, under Step 2B, the additional elements, taken alone or in combination, do not represent significantly more than the abstract idea itself. The claim includes “[a] system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations,” in which the use of generic computer components (one or more computers, one or more storage devices) to execute instructions to implement the abstract idea, (MPEP § 2106.05(f)), that does not amount to significantly more than the abstract idea. Also, the claim recites additional elements of “a dynamics model,” and “a prediction model,” which are described at a high level of generality, and accordingly, are generic computer components used to implement the abstract idea. (MPEP § 2106.05(f); see also Specification at p. 11, ll. 31-33 (“The representation, dynamics, and prediction models can each be implemented as a respective neural network with any appropriate neural network architecture that enables it to perform its described function.”)), and do not amount to significantly more than the abstract idea. Also, the claim recites “[(a)] receiving a current observation characterizing a current environment state of the environment,” which is the well-understood, routine, and conventional activity of receiving data over a network, (MPEP § 2106.05(d) sub II.i), that does not amount to significantly more than the abstract idea. Therefore, claim 23 is subject-matter ineligible.
Claim 24 recites a “one or more computer storage media,” which is a product, and thus one of the statutory categories of patentable subject matter. (35 U.S.C. § 101). 
However, under Step 2A Prong One, the claim recites the limitations of “[(b)] performing a plurality of planning iterations to generate plan data that indicates a respective value to performing the task of the agent performing each of the set of actions in the environment and starting from the current environment state,” and “[(c)] selecting, from the set of actions, an action to be performed by the agent in response to the current observation based on the generated plan data.” The activities of “[(b)] performing . . . planning iterations,” and “[(c)] selecting” can practically be performed in the human mind, including, for example, observations, evaluations, judgments, and opinions, and accordingly, are a mental process, (MPEP § 2106.04(a)(2) sub III), which is one of the groupings of abstract ideas. (MPEP § 2106.04(a)(2)). 
The claim recites more details or specifics to the abstract idea of “[(b)] performing a plurality of planning iterations,” where “performing each planning iteration comprises: [(b.1)] selecting a sequence of actions to be performed by the agent starting from the current environment state based on outputs generated by: [(b.1.1)] (i) a dynamics model . . . ; and [(b.1.2)] (ii) a prediction model . . . ,” and accordingly, is merely more specific to the abstract idea. Moreover, the claim recites “[(b.1)] selecting . . . based on outputs generated by: [(b.1.1)] a dynamics model . . . ; and [(b.1.2)] a prediction model . . . ,” where the plain meaning of the claimed “outputs” refers to predictions or decisions made by a machine learning model based on input data. The broadest reasonable interpretation of the claimed “outputs” cover predictive activities by a human user, (see, e.g., Specification at p. 18, ll. 22-24 (“agent that is being controlled by a human user”)), and can practically be performed in the human mind, including, for example, observations, evaluations, judgments, and opinions, and accordingly, are a mental process, (MPEP § 2106.04(a)(2) sub III), which is one of the groupings of abstract ideas. (MPEP § 2106.04(a)(2)). Accordingly, claim 24 is directed to an abstract idea.
Under Step 2A Prong Two, the claim as a whole is not integrated into a practical application, because the additional elements recited in the claim beyond the identified judicial exception include “[o]ne or more computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform the operations,” in which the use of generic computer components (one or more computers, [o]ne or more computer storage media) to execute instructions to implement the abstract idea, (MPEP § 2106.05(f)), that does not serve to integrate the abstract idea into a practical application. Also, the claim recites additional elements of “a dynamics model,” and “a prediction model,” which are described at a high level of generality, and accordingly, are generic computer components used to implement the abstract idea. (MPEP § 2106.05(f); see also Specification at p. 11, ll. 31-33 (“The representation, dynamics, and prediction models can each be implemented as a respective neural network with any appropriate neural network architecture that enables it to perform its described function.”)), and do not serve to integrate the abstract idea into a practical application. Also, the claim recites “[(a)] receiving a current observation characterizing a current environment state of the environment,” which is the pre-processing insignificant extra-solution activity of data gathering, (MPEP § 2106.05(g)), that does not serve to integrate the abstract idea into a practical application. Thus, claim 24 is directed to an abstract idea.
Finally, under Step 2B, the additional elements, taken alone or in combination, do not represent significantly more than the abstract idea itself. The claim includes ““[o]ne or more computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform the operations,” in which the use of generic computer components (one or more computers, [o]ne or more computer storage media) to execute instructions to implement the abstract idea, (MPEP § 2106.05(f)), that does not amount to significantly more than the abstract idea. Also, the claim recites additional elements of “a dynamics model,” and “a prediction model,” which are described at a high level of generality, and accordingly, are generic computer components used to implement the abstract idea. (MPEP § 2106.05(f); see also Specification at p. 11, ll. 31-33 (“The representation, dynamics, and prediction models can each be implemented as a respective neural network with any appropriate neural network architecture that enables it to perform its described function.”)), and do not amount to significantly more than the abstract idea. Also, the claim recites “[(a)] receiving a current observation characterizing a current environment state of the environment,” which is the well-understood, routine, and conventional activity of receiving data over a network, (MPEP § 2106.05(d) sub II.i), that does not amount to significantly more than the abstract idea. Therefore, claim 24 is subject-matter ineligible.
Claim 12 depends directly or indirectly from claim 11. Claim 25 depends directly or indirectly from claim 23. The claims recite more details or specifics to the abstract idea of “[(b.1.1)] the dynamics model . . . output,” that includes “[(b.1.1.4)] a predicted immediate reward value that represents an immediate reward that would be received if the agent performed the input action when the environment is in the input environment state,” and accordingly, is merely more specific to the abstract idea. The claim also provides more details or specifics to the abstract idea of “[(b.1.1.4)] a predicted immediate reward value,” “[(b.1.1.4.1)] wherein the immediate reward value is a numerical value that represents a progress in completing the task as a result of performing the input action when the environment is in the input environment state,” and accordingly, is merely more specific to the abstract idea. The abstract idea of these claims are not integrated into a practical application, (see MPEP § 2106.05(d)), nor do they amount to significantly more than the abstract idea, (MPEP § 2106.05(d)), because the claims recite no more than the abstract idea. Therefore, claims 12 and 25 are subject-matter ineligible.
Claim 13 depends directly or indirectly from claim 11. Claim 26 depends directly or indirectly from claim 23. The claims recite the limitation of “[(b.1)] wherein selecting the sequence of actions further comprises selecting the sequence of action based on: [(b.1.3)] outputs generated by a representation model that is configured to receive a representation input comprising the current observation and to generate as output a hidden state corresponding to the current state of the environment.” where the plain meaning of the claimed “outputs” refers to predictions or decisions made by a machine learning model based on input data. The broadest reasonable interpretation of the claimed “outputs” cover predictive activities by a human user, (see, e.g., Specification at p. 18, ll. 22-24 (“agent that is being controlled by a human user”)), and can practically be performed in the human mind, including, for example, observations, evaluations, judgments, and opinions, and accordingly, are a mental process, (MPEP § 2106.04(a)(2) sub III), which is sone of the groupings of abstract ideas. (MPEP § 2106.04(a)(2)). 
Under Step 2A Prong Two and Step 2B, Also, the claim recites the additional elements of “a representation model,” which is described at a high level of generality, and accordingly, is a generic computer component used to implement the abstract idea. (MPEP § 2106.05(f); see also Specification at p. 11, ll. 31-33 (“The representation, dynamics, and prediction models can each be implemented as a respective neural network with any appropriate neural network architecture that enables it to perform its described function.”)), and does not amount to significantly more than the abstract idea. Therefore, claims 13 and 26 are subject-matter ineligible.
Claim 14 depends directly or indirectly from claim 11. Claim 27 depends directly or indirectly from claim 23. The claims recite more details or specifics of abstract idea of [(b.1.3)] outputs generated by a representation model,” where “[(b.1.3)] the representation input [to the representation model] further comprises [(b.1.3.1)] one or more previous observations characterizing one or more previous states that the environment transitioned into prior to the current state,” and accordingly, are merely more specific to the abstract idea. Therefore, claims 14 and 27 are subject-matter ineligible.
Claim 15 depends directly or indirectly from claim 11. Claim 28 depends directly or indirectly from claim 23. The claims recite more details or specifics to the additional elements of “the [(b.1.3)] representation model, [(b.1.1)] the dynamics model, and [(b.1.2)] the prediction model,” in which the models “are jointly trained end-to-end on sampled trajectories from a set of trajectory data,” and therefore, are merely more specific to the additional elements. Therefore, claims 15 and 28 are subject-matter ineligible.
Claim 16 depends directly or indirectly from claim 11. Claims 28 depends directly or indirectly from claim 23. The claims recite more details or specifics to the additional elements of “the [(b.1.3)] representation model, [(b.1.1)] the dynamics model, and [(b.1.2)] the prediction model,” in which the models “are jointly trained end-to-end on an objective that measures, for each of a plurality of particular observations,” and accordingly, are merely more specific to the additional elements. With respect to the “objective that measures,” the abstract idea of “measures” include “(i) a policy error between the predicted policy output for the subsequent state generated conditioned on the particular observation and an actual policy that was used to select an action in response to the observation, and (ii) a value error between the value predicted for the subsequent state generated conditioned on the particular observation and an actual return received starting from the subsequent state,” which are merely more specific to the abstract idea. The abstract idea of these claims are not integrated into a practical application, (see MPEP § 2106.05(d)), nor do they amount to significantly more than the abstract idea, (MPEP § 2106.05(d)), because the claims recite no more than the abstract idea. Therefore, claims 16 and 28 are subject-matter ineligible.
Claim 17 depends directly or indirectly from claim 11. The claim further recites more details or specifics to the additional element of “measures,” “wherein the objective also measures, for each of the plurality of particular observations: . . . a reward error between the predicted immediate reward for the subsequent state generated conditioned on the particular observation and an actual immediate reward corresponding to the subsequent state,” and accordingly, are merely more specific to the abstract idea. The abstract idea of the claim is not integrated into a practical application, (see MPEP § 2106.05(d)), nor does the claim amount to significantly more than the abstract idea, (MPEP § 2106.05(d)), because the claim recites no more than the abstract idea. Therefore, claim 17 is subject-matter ineligible.
Claim 18 depends directly or indirectly from claim 11. The claim further recites more details or specifics to the additional elements “ [(b.1.1)] the dynamics model and [(b.1.3)] the representation model,” in which they “ are not trained to model semantics of the environment through the hidden states,” and accordingly, is merely more specific to the additional element. Therefore, claim 18 is subject-matter ineligible.
Claim 19 depends directly or indirectly from claim 11. The claim recites more details or specifics of the abstract idea of a “measure,” “wherein the actual return starting from the subsequent state is a bootstrapped n-step return,” and accordingly, is merely more specific to the abstract idea. The abstract idea of these claims are not integrated into a practical application, (see MPEP § 2106.05(d)), nor do they amount to significantly more than the abstract idea, (MPEP § 2106.05(d)), because the claims recite no more than the abstract idea. Therefore, claim 19 is subject-matter ineligible.
Claim 20 depends directly or indirectly from claim 11. The claim recites more details or specifics to the abstract idea of “[(c)] selecting, from the set of actions, an action,” which comprises “[(c.1)] selecting the action using a markov decision process (MDP) planning algorithm,” and accordingly, is merely more specific to the abstract idea. The abstract idea of these claims are not integrated into a practical application, (see MPEP § 2106.05(d)), nor do they amount to significantly more than the abstract idea, (MPEP § 2106.05(d)), because the claims recite no more than the abstract idea. Therefore, claim 20 is subject-matter ineligible.
Claim 21 depends directly or indirectly from claim 11. Claim 30 depends directly or indirectly from claim 23. The claims recite more details or specifics of the abstract idea of “ [(b.1)] selecting the sequence of actions,” and “[(c)] selecting the action to be performed” that “are performed using a monte carlo tree search (MCTS) algorithm,” and accordingly, are merely more specific to the abstract idea. The abstract idea of these claims are not integrated into a practical application, (see MPEP § 2106.05(d)), nor do they amount to significantly more than the abstract idea, (MPEP § 2106.05(d)), because the claims recite no more than the abstract idea. Therefore, claims 21 and 30 are subject-matter ineligible.
Claim 22 depends directly or indirectly from claim 11. The claim recites more details or specifics to the abstract idea of “[(c)] selecting, from the set of actions, an action,” comprises “[(c.2)] determining, from the sequences of actions in the plan data, a sequence of actions that has a maximum associated value output,” and “[(c.3)] selecting, as the action to be performed by the agent in response to the current observation, the first action in the determined sequence of actions,” and accordingly, is merely more specific to the abstract idea. The abstract idea of these claim is not integrated into a practical application, (see MPEP § 2106.05(d)), nor does it amount to significantly more than the abstract idea, (MPEP § 2106.05(d)), because the claim recites no more than the abstract idea. Therefore, claims 21 and 30 are subject-matter ineligible.
Claim Rejections – 35 U.S.C. § 103 %%%
8.	The following is a quotation of 35 U.S.C. § 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
9.	The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. § 103 are summarized as follows:
1. 	Determining the scope and contents of the prior art.
2. 	Ascertaining the differences between the prior art and the claims at issue.
3. 	Resolving the level of ordinary skill in the pertinent art.
4. 	Considering objective evidence present in the application indicating obviousness or nonobviousness.
10.	This application currently names joint inventors. In considering patentability of the claims the Examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the Examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
11.	Claims 11, 12, 23, 24, and 25 are rejected under 35 U.S.C. § 103 as being unpatentable over Xie et al., “Improvisation through Physical Understanding: Using Novel Objects as Tools with visual Foresight,” arXiv (2019) [hereinafter Xie] in view of US Published Application 20170213150 to Arel et al. [hereinafter Arel].
Regarding claims 11, 23, and 24, Xie teaches [a] method for selecting, from a set of actions, actions to be performed by an agent interacting with an environment to cause the agent to perform a task (Xie, Abstract), [a] system comprising one or more computers and one or more storage devices (Xie, abstract, teaches “[m]achine learning techniques have enabled robots to learn narrow, yet complex tasks and also perform broad, yet simple skills with a wide variety of objects [(that is, a “robot” inherently is a system comprising one or more computers and one or more storage devices)]”) of claim 23, and [o]ne or more computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform (Xie, abstract, teaches “[m]achine learning techniques have enabled robots to learn narrow, yet complex tasks and also perform broad, yet simple skills with a wide variety of objects [(that is, a “robot” inherently includes [o]ne or more computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform)]”), comprising:
[(a)] receiving a current observation characterizing a current environment state of the environment (Xie, left column of p. 3, “III. Capabilities for Improvisational Tool Use,” first full paragraph, teaches “autonomously collecting data of diverse object interactions, training predictive models of low-level sensory observations (i.e. action-conditioned video prediction [17]) [(that is, “low-level sensory observations” is receiving a current observation characterizing a current environment state of the environment)], and using these models to make plans to achieve goals involving tools”);
[(b)] performing a plurality of planning iterations to generate plan data (Xie, left column of p. 2, “I. Introduction,” first full paragraph, teaches “[o]ur method uses video prediction to reason about potential robot actions, constructing plans to manipulate novel objects on the fly, in less than a second [(that is, to generate a plan)]”; Xie, left column of p. 6, “B. Test-Time Control, 2) Planning with Demonstration Guidance,” first paragraph, teaches that “[p]lanning with [guided visual foresight (GVF)] at test time is illustrated in Figure 2 (right) and Algorithm 1. The user first specifies the task by clicking on the pixels that shall be moved and the corresponding goal-pixels. The planner searches for actions using the cross entropy method (CEM) [23], a common iterative sampling-based optimization procedure [(that is, “the planner” is performing a plurality of planning iterations to generate plan data)]”); that indicates a respective value to performing the task of the agent performing each of the set of actions in the environment and starting from the current environment state (Xie, left column of p. 4, “IV. Demonstration-Guided Visual Planning,” first & second paragraphs, teaches that, “[f]or example, the user might specify that three pieces of trash need to be moved to a location within a dustpan. Then, the current observation is passed to the action proposal model [(that is, starting from the current environment state)], which returns a sampling distribution that is used to sample a certain number of action sequences. . . . We feed each of the sampled action sequences into the video prediction model to predict their outcome as a video. We then rank these predictions using a cost function [(that is, “cost” indicates a respective value to performing the task)] determined by the human-specified goal, and refine the best samples further. Lastly, the robot recomputes action plans after several control cycles. . . . [D]emonstrations can be effectively used to guide the planning process towards tool-related behaviors, while the predictive model is used to fully construct and refine a sequence of actions for completing the task [(that is, of the agent performing each of the set of actions in the environment and starting from the current environment state)]”), 
[(b.1)] wherein performing each planning iteration comprises:
[(b.1.1)] selecting a sequence of actions to be performed by the agent starting from the current environment state based on outputs (Xie, right column of p. 3, “IV. Demonstration-Guided Visual Planning,” second paragraph, teaches “[w]e use [kinesthetic demonstration] data to train an action proposal model to obtain a distribution over action sequences [(that is, “action sequences” is selecting a sequence of actions to be performed by the agent)] conditioned on the initial image [(that is, starting from the current environment state)] based on actions taken by the demonstrator”) generated by:
[(b.1.1.1)] (i) a dynamics model (Xie, right column of p. 5, “4) Predictive Model Training,” first paragraph, teaches that “[t]he forward pass of the dynamics model is summarized in the following two equations:

    PNG
    media_image1.png
    77
    461
    media_image1.png
    Greyscale

[(that is, a dynamics model)].” The model is trained with stochastic gradient descent using a `2 image reconstruction loss”) that is configured to receive as input 
[(b.1.1.1.1)] a) a hidden state corresponding to an input environment state and b) an input action from the set of actions (Xie, left column at p. 11, “A. Video Prediction Model Implementation Details,” first paragraph, teaches that “[a]t every time-step an action at [(that is, an input action from the set of actions)] is passed into the model along with the hidden state ht, [(that is, to receive as input a hidden state corresponding to an input environment state)] producing a new state ht+1 and a flow field                         
                            
                                
                                    
                                        
                                            F
                                        
                                        ^
                                    
                                
                                
                                    t
                                    +
                                    1
                                    ←
                                    t
                                
                            
                        
                     which is used to transform the image via bi-linear sampling”) and 
[(b.1.1.1.2)] to generate as output at least a hidden state corresponding to a predicted next environment state that the environment would transition into if the agent performed the input action when the environment is in the input environment state (Xie, left column at p. 11, “A. Video Prediction Model Implementation Details,” first paragraph, teaches that “[a]t every time-step an action at is passed into the model along with the hidden state ht, producing a new state ht+1 and a flow field                         
                            
                                
                                    
                                        
                                            F
                                        
                                        ^
                                    
                                
                                
                                    t
                                    +
                                    1
                                    ←
                                    t
                                
                            
                        
                     [(that is, “a new state ht+1 “ is a dynamics model . . . to generate as output at least a hidden state corresponding to a predicted next environment state that the environment would transition into if the agent performed the input action when the environment is in the input environment state)]”); and
[(b.1.1.2)] (ii) a prediction model (Xie, left column at p. 6, “2) Planning with Demonstration Guidance,” first paragraph, teaches “After rolling out the video prediction model                         
                            
                                
                                    ƒ
                                
                                
                                    γ
                                
                            
                        
                     [(that is, a prediction model)] using Equation 7 

    PNG
    media_image2.png
    39
    391
    media_image2.png
    Greyscale

we obtain M different predicted probability distributions                         
                            
                                
                                    P
                                
                                
                                    1
                                    :
                                    H
                                
                                
                                    m
                                
                            
                        
                    , which are ranked using the cost function c”) that is configured to 
[(b.1.1.2.1)] receive as input the hidden state corresponding to the predicted next environment state (Xie, right column of p. 5, “4) Predictive Model Training,” first paragraph, teaches “[t]he model, which is implemented as a recurrent convolutional neural network, ꝭ parameterized by ү, has a hidden state ht and takes in a previous image and an action at each step of the rollout”; Xie, left column of p. 11, “VII. Appendix, A. Video Prediction Model Implementation Details,” first paragraph, teaches “[a]t every time-step an action at is passed into the model along with the hidden state ht, producing a new state ht+1 [(that is, as input the hidden state)] and a flow field                         
                            
                                
                                    
                                        
                                            F
                                        
                                        ^
                                    
                                
                                
                                    t
                                    +
                                    1
                                    ←
                                    t
                                
                            
                        
                     [(that is, a prediction model that is configured to receive as input the hidden state corresponding to the predicted next environment state)] which is used to transform the image via bi-linear sampling”) and to 
[(b.1.1.2.2)] generate as output 
[(b.1.1.2.2.1)] a) a predicted policy output that defines a score distribution over the set of actions (Xie, Algorithm 1 Guided Visual Foresight (test time), teaches a guided visual foresight (GVF) model [Examiner annotations in dashed-line text boxes]:

    PNG
    media_image3.png
    461
    858
    media_image3.png
    Greyscale

Xie, left column of p. 6, “2) Planning with Demonstration Guidance,” first paragraph, teaches “After rolling out the video prediction model                         
                            
                                
                                    ꝭ
                                
                                
                                    ү
                                
                            
                        
                    using Equation 7 we obtain M different predicted probability distributions                         
                            
                                
                                    
                                        
                                            P
                                        
                                        ^
                                    
                                
                                
                                    1
                                    :
                                    H
                                
                                
                                    m
                                
                            
                        
                    , which are ranked using the cost function c. We then fit a Gaussian distribution to the best k action samples (see line 10). In later CEM iterations, actions are sampled from the fitted Gaussians (line 7) [(that is, a) a predicted policy output that defines a score distribution over the set of actions)]”) and 
* * *
and [(c)] selecting, from the set of actions, an action to be performed by the agent in response to the current observation based on the generated plan data (Xie, Fig. 2, teaches an action-conditioned video prediction model,[Examiner annotations in dashed-line text boxes]:

    PNG
    media_image4.png
    477
    887
    media_image4.png
    Greyscale

Xie, Fig. 2 caption, teaches “[o]ur guided visual foresight (GVF) approach, at training time (left) and test time (right). Our method incorporates demonstrations and autonomous data collection to learn a video prediction model and action proposal model that enable the robot to solve both a diverse range of goals that require tool use [(that is, selecting, from the set of actions, an action to be performed by the agent in response to the current observation based on the generated plan data)]. We incorporate the action proposal model both for training data for the video prediction model and for improving the sampling-based planner at test time. The test time procedure is further detailed in Algorithm 1 [shown above]”).
Though Xie teaches a pixel distance cost function that evaluates how far a designated pixel is from the goal pixels, Xie, however, does not explicitly teach –
* * *
[(b.1.1.2) (ii) a prediction model that is configured to 
* * *
(b.1.1.2.2) generate as output]
* * *
[(b.1.1.2.2.2)] b) a value output that represents a value of the environment being in the predicted next environment state to performing the task; and
* * *
But Arel teaches -
* * *
[(b.1.1.2) (ii) a prediction model that is configured to 
* * *
(b.1.1.2.2) generate as output] 
* * *
[(b.1.1.2.2.2)] b) a value output that represents a value of the environment being in the predicted next environment state to performing the task (Arel ¶ 0083 “the system instead uses state value supervised learning models that are configured to receive a state representation representing a given state and to generate a state value estimate [(that is, a value output)] that is an estimate of the long-term value of the environment having transitioned into the given state, e.g., of the return received starting from the environment being in the state [(that is, a value output that represents a value of the environment being in the predicted next environment state to performing the task)]. For example, the system can use these state value supervised learning models [(that is, a prediction model)] in conjunction with a transition model [(that is, a dynamics model)] that receives a state and an action as input and predicts a state that is most likely to be the state the environment transitions into as a result of the actor performing the action in response to the given state representation to select the action to be performed by the agent”); and
* * *
Xie and Arel are from the same or similar field of endeavor. Xie teaches training a model with both a visual and physical understanding of multi-object interactions by combining diverse demonstration data with self-supervised interaction data to build generalizable models and the demonstration data to guide a model-based RL planner to solve complex tasks. Arel teaches a reinforcement learning system selects actions to maximize a return, which is a function of immediate rewards to be performed by the agent.
Thus, it would have been obvious to a person having ordinary skill in the art as of the effective filing date of the Applicant’s invention to modify Xie pertaining to a model-based RL planner with the immediate rewards expectation of Arel.
The motivation to do so is because “supervised learning models can be trained in a scalable manner to effectively select actions in response to new state representations without adversely affecting their performance when the environment is in other states.” (Arel ¶ 0007).
Regarding claims 12 and 25, the combination of Xie and Arel teaches all of the limitations of claims 11 and 23, as described above in detail. 
Arel teaches -
[(b.1.1.1.3)] wherein the dynamics model (see above, where Arel ¶ 0083 teaches a transitions model [(that is, a dynamics model)]) also generates as output a predicted immediate reward value that represents an immediate reward that would be received if the agent performed the input action when the environment is in the input environment state (Arel ¶ 0019 teaches “[g]enerally, the reward is a numeric value that is received from the environment as it transitions into a given state and is a function of the state of the environment. While the agent is interacting with the environment, the reinforcement learning system selects actions to be performed by the agent in order to maximize the expected return. Generally, the expected return is a function of the rewards anticipated to be received over time in response to future actions performed by the agent. That is, the return is a function of future rewards [(that is, “future rewards” are a predicted immediate reward value)] received starting from the immediate reward received in response to the agent performing the selected action. For example, possible definitions of return that the reinforcement learning system attempts to maximize may include a sum of the future rewards, a discounted sum of the future rewards, or an average of the future rewards [(that is, output a predicted immediate reward value that represents an immediate reward that would be received if the agent performed the input action when the environment is in the input environment state)]”), 
[(b.1.1.1.3.1)] wherein the immediate reward value is a numerical value that represents a progress in completing the task as a result of performing the input action when the environment is in the input environment state (Arel ¶ 0019 teaches “[g]enerally, the reward is a numeric value that is received from the environment as it transitions into a given state and is a function of the state of the environment. While the agent is interacting with the environment, the reinforcement learning system selects actions to be performed by the agent in order to maximize the expected return. Generally, the expected return is a function of the rewards anticipated to be received over time in response to future actions performed by the agent [(that is, “rewards anticipated to be received over time” is a numerical value that represents a progress in completing the task)]. That is, the return is a function of future rewards received starting from the immediate reward received in response to the agent performing the selected action [(that is, as a result of performing the input action when the environment is in the input environment state)]”).
12.	Claims 13-16 and 26-29 are rejected under 35 U.S.C. § 103 as being unpatentable over Xie et al., “Improvisation through Physical Understanding: Using Novel Objects as Tools with visual Foresight,” arXiv (2019) [hereinafter Xie] in view of US Published Application 20170213150 to Arel et al. [hereinafter Arel] and Liang et al., “VMAV-C: A Deep Attention-based Reinforcement Learning Algorithm for Model-based Control,” arXiv (2018) [hereinafter Liang].
Regarding claims 13 and 26, the combination of Xie and Arel teaches all of the limitations of claims 11 and 23, as described above in detail. 
Though Xie and Arel teach a model receiving a state representation to output an output for a state-action pair, the combination of Xie and Arel, however, does not explicitly teach –
[(b.1)] wherein selecting the sequence of actions further comprises selecting the sequence of action based on:
[(b.1.3)] outputs generated by a representation model that is configured to receive a representation input comprising the current observation and to generate as output a hidden state corresponding to the current state of the environment.
But Liang teaches -
[(b.1)] wherein selecting the sequence of actions further comprises selecting the sequence of action based on:
[(b.1.3)] outputs generated by a representation model that is configured to receive a representation input comprising the current observation and to generate as output a hidden state corresponding to the current state of the environment (Liang, Fig. 5, teaches an attention-based value function representation model [Examiner annotations in dashed-line text boxes]:

    PNG
    media_image5.png
    487
    959
    media_image5.png
    Greyscale

Liang at p. 10, “Involvement of [Attention-based Value function (AVF)],” first paragraph, teaches “given hidden information of some t-length sequence H = [h1, h2, . . , ht] , the context vector v for predicting time step serves as the embedding information for historical sequence and is computed as the weighted sum of hidden information in such time step                         
                            v
                            =
                            
                                
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                    
                                        t
                                    
                                
                                
                                    
                                        
                                            α
                                        
                                        
                                            i
                                        
                                    
                                
                            
                            
                                
                                    h
                                
                                
                                    i
                                
                            
                        
                     [(that is, the “output v” includes “hidden-state hi,” which is to generate as output a hidden state corresponding to the current state of the environment )]”).
Xie, Arel, and Liang are from the same or similar field of endeavor. Xie teaches training a model with both a visual and physical understanding of multi-object interactions by combining diverse demonstration data with self-supervised interaction data to build generalizable models and the demonstration data to guide a model-based RL planner to solve complex tasks. Arel teaches a reinforcement learning system selects actions to maximize a return, which is a function of immediate rewards to be performed by the agent. Liang teaches a model-based reinforcement learning algorithm with attention mechanism embedded. 
Thus, it would have been obvious to a person having ordinary skill in the art as of the effective filing date of the Applicant’s invention to modify the combination of Xie and Arel pertaining to a model-based RL planner implementing an Attention-based Value Function (AVF) model of Liang.
The motivation to do so is because in “[universal complex tasks, the bottlenecks encountered includes] the low efficiency of data utilization in model-free reinforcement algorithms . . . . In contrast, the model-based reinforcement learning algorithms can reveal underlying dynamics in learning environments and seldom suffer the data utilization problem. To address the problem, a model-based reinforcement learning algorithm with attention mechanism embedded is proposed as an extension of World Models [in which an] agent can learn optimal policies through less interactions with actual environment, and final experiments demonstrate the effectiveness of our model in control problem.” (Liang, Abstract).
Regarding claims 14 and 27, the combination of Xie, Arel, and Liang teaches all of the limitations of claims 13 and 26, as described above in detail. 
Liang teaches -
[(b.1.3)] wherein the representation input further comprises 
[(b.1.3.1)] one or more previous observations characterizing one or more previous states that the environment transitioned into prior to the current state (Liang at p. 10, “Involvement of AVF,” second paragraph, teaches “[i]n Critic network of [actor-critic] algorithm, hidden information of each time illustrated in Fig 5 comes from MDN-RNN route layer. And historical information in former n time steps is utilized for current state value estimation. . . . For the state value function estimation, it is required to combine both context information 𝑐𝑡 derived from previously hidden information {ht−1, ht−2, . . , ht−n} and current state information [(that is, one or more previous observations characterizing one or more previous states that the environment transitioned into prior to the current state)] as

    PNG
    media_image6.png
    53
    266
    media_image6.png
    Greyscale

Where zt is the latent representation of state in time step t, ct is the context vector with attention, [. , . ] is the concatenation of vectors and {W, b, Wv, bv} is the set of parameters to learn in attention-based value neural network. Fig 5 [as earlier shown above] reveals the learning process of state value v.”)
Regarding claims 15 and 28, the combination of Xie, Arel, and Liang teaches all of the limitations of claims 13 and 26, as described above in detail. 
Liang teaches -
wherein [(b.1.3)] the representation model, [(b.1.1)] the dynamics model, and [(b.1.2)] the prediction model are jointly trained end-to-end on sampled trajectories from a set of trajectory data (Liang at p. 12, “4.3.1 Pretraining Details,” first paragraph, teaches “The purpose of VAE [(that is, the dynamics model)], MDN-RNN [(that is, the prediction model)], AVF [(that is, the representation model)] and Controller is to learn representations of states and the dynamic transitions in the environment at the same time, but massive parameters and complexity of network structures make it tough and time-consuming to train VMAV-C. Hence, synchronously pretraining VMAV [(that is, are jointly trained end-to-end)] is the required step in our experiments. To achieve this aim, we collect 2000 episodes with random policy strategy through a series of interactions with actual environment as {episode = {(xt, at, xt+1, rt+1, dt+1)}} in Step 0. These rollouts/screenshots of the environment serve as the training dataset for VAE, and we assume the sampling has approximately encompassed the dynamic information of the environment, especially the state representations and concerning transitions [(that is, jointly trained end-to-end on sampled trajectories from a set of trajectory data)]”).
Regarding claims 16 and 29, the combination of Xie, Arel, and Liang teaches all of the limitations of claims 15 and 28, as described above in detail. 
Liang teaches
wherein [(b.1.3)] the representation model, [(b.1.1)] the dynamics model, and [(b.1.2)] the prediction model are jointly trained end-to-end (Liang at p. 12, “4.3.1 Pretraining Details,” first paragraph, teaches “The purpose of VAE [(that is, the dynamics model)], MDN-RNN [(that is, the prediction model)], AVF [(that is, the representation model)] and Controller is to learn representations of states and the dynamic transitions in the environment at the same time, but massive parameters and complexity of network structures make it tough and time-consuming to train VMAV-C. Hence, synchronously pretraining VMAV [(that is, are jointly trained end-to-end)] is the required step in our experiments”) on an objective that measures (Liang at p. 4, “3. Background,” first paragraph, teaches “The main goal of reinforcement learning is to capture some policy to maximize the cumulative rewards, rewards, which means selecting proper action given some states [(that is, the “policy π” is jointly trained end-to-end on an objective that measures)]”), for each of a plurality of particular observations:
for each of one or more subsequent states that follow the state represented by the particular observation in the trajectory:
(i) a policy error between the predicted policy output for the subsequent state generated conditioned on the particular observation and an actual policy that was used to select an action in response to the observation (Liang at p. 5, “Proximal Policy Optimization (PPO),” second paragraph, teaches “in which Kullback-Leibler(KL) divergence between old policy and updated policy is considered in objective function [(that is, “divergence” is a policy error)], and the KL divergence in each state point can be bounded as well”; Liang at p. 5, Proximal Policy Optimization (PPO),” second paragraph, teaches a “surrogate loss function in original TRPO can be formulated as

    PNG
    media_image7.png
    61
    556
    media_image7.png
    Greyscale

Where π is some stochastic policy,                         
                            
                                
                                    π
                                
                                
                                    
                                        
                                            θ
                                        
                                        
                                            o
                                            l
                                            d
                                        
                                    
                                
                            
                        
                     is the parameters in policy in last time [(that is, an actual policy that was used to select an action in response to the observation and πθ is the predicted policy output for the subsequent state generated conditioned on the particular observation)], and At estimates the advantage function of performing at conditioned on the state st at time step t”), and
* * *
Arel teaches –
* * *
for each of one or more subsequent states that follow the state represented by the particular observation in the trajectory:
* * *
(ii) a value error between the value predicted for the subsequent state generated conditioned on the particular observation and an actual return received starting from the subsequent state (Arel ¶ 0059 teaches “[t]he system can measure the performance of the supervised learning models as of a given sequence representation [(that is, “a given sequence,” is starting from the subsequent state)] based on estimation errors between the value function estimates and the actual returns for the state representations before the given state representation in the sequence. The estimation error may be the difference between the value function estimate and the actual return, the square of the difference between the value function estimate and the actual return, or any other appropriate machine learning error measure for the models [(that is, for each of one or more subsequent states that follow the state represented by the particular observation in the trajectory: . . . (ii) a value error between the value predicted for the subsequent state generated conditioned on the particular observation and an actual return received starting from the subsequent state)]”).
13.	Claim 17 is rejected under 35 U.S.C. § 103 as being unpatentable over Xie et al., “Improvisation through Physical Understanding: Using Novel Objects as Tools with visual Foresight,” arXiv (2019) [hereinafter Xie] in view of US Published Application 20170213150 to Arel et al. [hereinafter Arel], Liang et al., “VMAV-C: A Deep Attention-based Reinforcement Learning Algorithm for Model-based Control,” arXiv (2018) [hereinafter Liang], and Mongillo et al., “The Misbehavior of Reinforcement Learning,” IEEE (2014) [hereinafter Mongillo].
Regarding claim 17, the combination of Xie, Arel, and Liang teaches all of the limitations of claim 16, as described above in detail. 
Though Xie, Arel, and Liang teach reinforcement learning having measures on predictive performance, the combination of Xie, Arel, and Liang, however, do not explicitly teach –
wherein the objective also measures, for each of the plurality of particular observations:
for each of the one or more subsequent states that follow the state represented by the particular observation in the trajectory:
a reward error between the predicted immediate reward for the subsequent state generated conditioned on the particular observation and an actual immediate reward corresponding to the subsequent state.
But Mongillo teaches –
wherein the objective also measures, for each of the plurality of particular observations:
for each of the one or more subsequent states that follow the state represented by the particular observation in the trajectory:
a reward error between the predicted immediate reward for the subsequent state generated conditioned on the particular observation and an actual immediate reward corresponding to the subsequent state (Mongillo, right column of p. 530, “A. Value-Based Learning,” first paragraph, teaches “[w]e consider the sequence of states, actions, and rewards of an agent interacting with an environment. In each cycle, the agent, being in state s and taking action a, updates its estimate of the corresponding state–action value function Q(a,s) according to

    PNG
    media_image8.png
    84
    550
    media_image8.png
    Greyscale

where η > 0 is the learning rate, and                         
                            δ
                            ≡
                            r
                            +
                            Q
                            (
                            
                                
                                    a
                                
                                
                                    '
                                
                            
                            ,
                            
                                
                                    s
                                
                                
                                    '
                                
                            
                            )
                            -
                            Q
                            (
                            a
                            ,
                            s
                            )
                        
                     is the reward prediction error (RPE) [(that is, a reward error)], with a’ and s’ being the next action and the next state, respectively [(that is, the predicted immediate reward for the subsequent state-generated conditioned on the particular observation )], and r is the obtained reward in the cycle [(that is, “Q(a,s)” and “obtained reward r” is an actual immediate reward corresponding to the subsequent state)]. The RPE is a basic quantity that plays a central role in all value-based methods. Roughly speaking, it is a measure of how good the agent is at predicting the consequences of its behavior”).
Xie, Arel, Liang, and Mongillo are from the same or similar field of endeavor. Xie teaches training a model with both a visual and physical understanding of multi-object interactions by combining diverse demonstration data with self-supervised interaction data to build generalizable models and the demonstration data to guide a model-based RL planner to solve complex tasks. Arel teaches a reinforcement learning system selects actions to maximize a return, which is a function of immediate rewards to be performed by the agent. Liang teaches a model-based reinforcement learning algorithm with attention mechanism embedded. Mongillo teaches reinforcement learning is a collection of methods devised to find the optimal policy, a (possibly stochastic) mapping from observations (including hidden observations) to actions, that realizes the goal of the agent, who may receive only partial information due to the hidden state.
Thus, it would have been obvious to a person having ordinary skill in the art as of the effective filing date of the Applicant’s invention to modify the combination of Xie, Arel, and Liang, pertaining to a model-based RL planner implementing an Attention-based Value Function (AVF) model with the reward prediction error of Mongillo.
The motivation to do so is “[reward prediction error (RPE)] is a basic quantity that plays a central role in all value-based methods. Roughly speaking, it is a measure of how good the agent is at predicting the consequences of its behavior”. (Mongillo, right column of p. 530, “A. Value-Based Learning,” first paragraph).
14.	Claim 18 is rejected under 35 U.S.C. § 103 as being unpatentable over Xie et al., “Improvisation through Physical Understanding: Using Novel Objects as Tools with visual Foresight,” arXiv (2019) [hereinafter Xie] in view of US Published Application 20170213150 to Arel et al. [hereinafter Arel], Liang et al., “VMAV-C: A Deep Attention-based Reinforcement Learning Algorithm for Model-based Control,” arXiv (2018) [hereinafter Liang], and Ming et al., “Understanding Hidden Memories of Recurrent Neural Networks,” IEEE (2017) [hereinafter Ming].
Regarding claim 18, the combination of Xie, Arel, and Liang teaches all of the limitations of claim 15, as described above in detail. 
Though Xie, Arel, and Liang teach joint training with each of the system models, the combination of Xie, Arel, and Liang do not explicitly teach -
wherein [(b.1.1)] the dynamics model and [(b.1.3)] the representation model are not trained to model semantics of the environment through the hidden states.
But Ming teaches -
wherein [(b.1.1)] the dynamics model and [(b.1.3)] the representation model are not trained to model semantics of the environment through the hidden states (Ming, left column at p. 14, “1. Introduction,” first full paragraph, teaches “we found that semantic information in hidden states is highly distributed, i.e., each input word will generally result in changes almost every hidden state units [(that is, ; Ming, right column p. 14, “2.1 Understanding RNNs,” fourth paragraph, teaches “two-attention-based models [(that is, dynamics model and representation model)] in image captioning [(that is, “captioning” is semantics)], which revealed the reasons behind the effectiveness of their models. Although the attention mechanism can benefit the interpretation without extra effort, it requires jointly training different models or modifying the original model, which limits its application in general RNN models [(that is, without “jointly training” is are not trained to model semantics of the environment through the hidden states )]”).
Xie, Arel, Liang, and Ming are from the same or similar field of endeavor. Xie teaches training a model with both a visual and physical understanding of multi-object interactions by combining diverse demonstration data with self-supervised interaction data to build generalizable models and the demonstration data to guide a model-based RL planner to solve complex tasks. Arel teaches a reinforcement learning system selects actions to maximize a return, which is a function of immediate rewards to be performed by the agent. Liang teaches a model-based reinforcement learning algorithm with attention mechanism embedded. Ming teaches a glyph-based sequence visualization based on aggregate information to analyze the behavior of an RNN’s hidden state at the sentence-level. 
Thus, it would have been obvious to a person having ordinary skill in the art as of the effective filing date of the Applicant’s invention to modify the combination of Xie, Arel, and Liang, pertaining to a model-based RL planner implementing an Attention-based Value Function (AVF) model with glyph-based analysis at a sentence level of Ming.
The motivation to do so is because “a glyph-based sequence visualization based on aggregate information to analyze the behavior of an RNN’s hidden state at the sentence-level [provides] usability and effectiveness . . . demonstrated through case studies and reviews from domain experts.” (Ming, Abstract).
15.	Claim 19 is rejected under 35 U.S.C. § 103 as being unpatentable over Xie et al., “Improvisation through Physical Understanding: Using Novel Objects as Tools with visual Foresight,” arXiv (2019) [hereinafter Xie] in view of US Published Application 20170213150 to Arel et al. [hereinafter Arel], Liang et al., “VMAV-C: A Deep Attention-based Reinforcement Learning Algorithm for Model-based Control,” arXiv (2018) [hereinafter Liang], and François-Lavet et al., “An Introduction to Deep Reinforcement Learning,” arXiv (2019) [hereinafter François].
Regarding claim 19, the combination of Xie, Arel, and Liang teaches all of the limitations of claim 15, as described above in detail. 
Through Xie, Arel, and Liang teach the use of a return function relating to the agent performing selected actions in response to a state representation, the combination of Xie, Arel, and Liang, however, does not explicitly teach –
wherein the actual return starting from the subsequent state is a bootstrapped n-step return.
But François teaches -
wherein the actual return starting from the subsequent state is a bootstrapped n-step return (François at p. 33, “To Bootstrap or not to Bootstrap?,” first paragraph, teaches “Bootstrapping has both advantages and disadvantages. On the negative side, using pure bootstrapping methods (such as in DQN) are prone to instabilities when combined with function approximation because they make recursive use of their own value estimate at the next time-step. On the contrary, methods such as n-step Q-learning rely less on their own value estimate [(that is, return)] because the estimate used is decayed by үn for the nth step backup [(that is, the actual return starting from the subsequent state is a bootstrapped n -step return)]”).
Xie, Arel, Liang, and François are from the same or similar field of endeavor. Xie teaches training a model with both a visual and physical understanding of multi-object interactions by combining diverse demonstration data with self-supervised interaction data to build generalizable models and the demonstration data to guide a model-based RL planner to solve complex tasks. Arel teaches a reinforcement learning system selects actions to maximize a return, which is a function of immediate rewards to be performed by the agent. Liang teaches a model-based reinforcement learning algorithm with attention mechanism embedded. François teaches the use of a target value to update Q-network parameters based on the immediate reward and the following steps in the return.
Thus, it would have been obvious to a person having ordinary skill in the art as of the effective filing date of the Applicant’s invention to modify the combination of Xie, Arel, and Liang, pertaining to a model-based RL planner implementing an Attention-based Value Function (AVF) model with the bootstrapping return of François.
The motivation to do so is because “[b]ootstrapping also has advantages [of allowing] learning from off-policy samples.” (François at p. 33, “To Bootstrap or not to Bootstrap,” second paragraph).
16.	Claims 20-22 and 30 are rejected under 35 U.S.C. § 103 as being unpatentable over Xie et al., “Improvisation through Physical Understanding: Using Novel Objects as Tools with visual Foresight,” arXiv (2019) [hereinafter Xie] in view of US Published Application 20170213150 to Arel et al. [hereinafter Arel], and François-Lavet et al., “An Introduction to Deep Reinforcement Learning,” arXiv (2019) [hereinafter François].
Regarding claim 20 the combination of Xie and Arel teaches all of the limitations of claim 11, as described above in detail. 
Though Xie and Arel teaches the use of hidden states in reinforcement learning, the combination of Xie and Arel does not explicitly teach –
[(c)] wherein selecting, from the set of actions, an action to be performed by the agent in response to the current observation based on the generated plan data comprises 
[(c.1)] selecting the action using a markov decision process (MDP) planning algorithm.
But François teaches -
[(c)] wherein selecting, from the set of actions, an action to be performed by the agent in response to the current observation based on the generated plan data comprises 
[(c.1)] selecting the action using a markov decision process (MDP) planning algorithm (François, Fig. 10.1, teaches a partially observable Markov decision process (POMDP), where an agent must make action decisions based on observations that do not fully reveal the underlying state (that is, a hidden state) [Examiner annotations in dashed-line text boxes]:

    PNG
    media_image9.png
    392
    608
    media_image9.png
    Greyscale

François, Fig. 10.1 caption, teaches the figure is an “[i]llustration of a POMDP. The actual dynamics of the POMDP is depicted in dark while the information that the agent can use to select the action at each step [(that is, selecting the action)] is the whole history Ht depicted in blue. [(that is, selecting the action using a Markov decision process (MDP) planning algorithm)]”).
Xie, Arel, and François are from the same or similar field of endeavor. Xie teaches training a model with both a visual and physical understanding of multi-object interactions by combining diverse demonstration data with self-supervised interaction data to build generalizable models and the demonstration data to guide a model-based RL planner to solve complex tasks. Arel teaches a reinforcement learning system selects actions to maximize a return, which is a function of immediate rewards to be performed by the agent. François teaches the use of a MPD to select agent actions.
Thus, it would have been obvious to a person having ordinary skill in the art as of the effective filing date of the Applicant’s invention to modify the combination of Xie and Arel, pertaining to a model-based RL planner implementing with the MPD for action selection of François.
The motivation to do so is because “deep RL opens up many new applications in domains such as healthcare, robotics, smart grids, finance, and many more. This manuscript provides an introduction to deep reinforcement learning models, algorithms and techniques. Particular focus is on the aspects related to generalization and how deep RL can be used for practical applications.” (François, Abstract).
Regarding claim 21, the combination of Xie, Arel, and François teaches all of the limitations of claim 20, as described above in detail. 
François teaches -
wherein selecting the sequence of actions for each planning iteration and selecting the action to be performed by the agent are performed using a monte carlo tree search (MCTS) algorithm (François at p. 47, “6.1.1 Lookahead search,” first and second paragraphs, teaches “[a] lookahead search in an MDP iteratively builds a decision tree where the current state is the root node. . . . Monte-Carlo tree search (MCTS) techniques (Browne et al., 2012) are popular approaches to lookahead search [(that is, wherein selecting the sequence of actions . . . and selecting the action to be performed by the agent are performed using a monte carlo tree search (MCTS) algorithm)]”).
Regarding claim 22, the combination of Xie, Arel, and François teaches all of the limitations of claim 20, as described above in detail. 
Arel teaches -
wherein [(c)] selecting, from the set of actions, an action to be performed by the agent in response to the current observation based on the generated plan data comprises:
[(c.2)] determining, from the sequences of actions in the plan data, a sequence of actions that has a maximum associated value output (Arel ¶ 0019 teaches “While the agent is interacting with the environment, the reinforcement learning system selects actions to be performed by the agent in order to maximize the expected return. Generally, the expected return is a function of the rewards anticipated to be received over time in response to future actions performed by the agent. That is, the return is a function of future rewards received starting from the immediate reward received in response to the agent performing the selected action [(that is, determining, from the sequences of actions in the plan data, a sequence of actions that has a maximum associated value output)]”); and
[(c.3)] selecting, as the action to be performed by the agent in response to the current observation, the first action in the determined sequence of actions (Arel, claim 1, teaches “selecting an action to be performed by a computer-implemented agent that interacts with an environment by performing actions selected from a set of actions [(that is, selecting, as the action performed by the agent in response to the current observation, the first action in the determined sequence of actions)]”).
Regarding claim 30, the combination of Xie and Arel teaches all of the limitations of claim 23, as described above in detail. 
Though Xie and Arel teaches the use of hidden states in reinforcement learning, the combination of Xie and Arel does not explicitly teach –
wherein selecting the sequence of actions for each planning iteration and selecting the action to be performed by the agent are performed using a monte carlo tree search (MCTS) algorithm.
But François teaches -
wherein selecting the sequence of actions for each planning iteration and selecting the action to be performed by the agent are performed using a monte carlo tree search (MCTS) algorithm (François at p. 47, “6.1.1 Lookahead search,” first and second paragraphs, teaches “[a] lookahead search in an MDP iteratively builds a decision tree where the current state is the root node. . . . Monte-Carlo tree search (MCTS) techniques (Browne et al., 2012) are popular approaches to lookahead search [(that is, wherein selecting the sequence of actions . . . and selecting the action to be performed by the agent are performed using a monte carlo tree search (MCTS) algorithm)]”).
Xie, Arel, and François are from the same or similar field of endeavor. Xie teaches training a model with both a visual and physical understanding of multi-object interactions by combining diverse demonstration data with self-supervised interaction data to build generalizable models and the demonstration data to guide a model-based RL planner to solve complex tasks. Arel teaches a reinforcement learning system selects actions to maximize a return, which is a function of immediate rewards to be performed by the agent. François teaches the use of a MPD to select agent actions.
Thus, it would have been obvious to a person having ordinary skill in the art as of the effective filing date of the Applicant’s invention to modify the combination of Xie, Arel, and Liang, pertaining to a model-based RL planner implementing with the MPD for action selection of François.
The motivation to do so is because “deep RL opens up many new applications in domains such as healthcare, robotics, smart grids, finance, and many more. This manuscript provides an introduction to deep reinforcement learning models, algorithms and techniques. Particular focus is on the aspects related to generalization and how deep RL can be used for practical applications.” (François, Abstract).
Conclusion
17.	The prior art made of record and not relied upon is considered pertinent to Applicant's disclosure:
(US Patent 9536191 to Arel et al.) teaches methods, systems, and apparatus, including computer programs encoded on computer storage media, for reinforcement learning using confidence scores, including determining a respective value function estimate that is an estimate of a return resulting from the agent performing the action in response to the current observation, determining a respective confidence score that is a measure of confidence that the respective value function estimate for the action is an accurate estimate of the return that will result from the agent performing the action in response to the current observation, adjusting the respective value function estimate for the action using the respective confidence score for the action to determine a respective adjusted value function estimate; and selecting an action to be performed by the agent in response to the current observation using the respective adjusted value function estimates.
(Mnih et al., “Human-Level Control through Deep Reinforcement Learning,” Nature (2015)) teaches that while reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. 
18.	Any inquiry concerning this communication or earlier communications from the Examiner should be directed to KEVIN L. SMITH whose telephone number is (571) 272-5964. Normally, the Examiner is available on Monday-Thursday 0730-1730. 
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, Applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the Examiner by telephone are unsuccessful, the Examiner’s supervisor, KAKALI CHAKI can be reached on 571-272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/K.L.S./
Examiner, Art Unit 2122

/KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122
Read full office action
Prosecution Timeline

Jul 22, 2022
Application Filed
Jan 11, 2026
Non-Final Rejection — §101, §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/321,251
Patent 12591815
METHOD AND SYSTEM FOR UPDATING MACHINE LEARNING BASED CLASSIFIERS FOR RECONFIGURABLE SENSORS
2y 5m to grant Granted Mar 31, 2026
17/704,721
Patent 12585917
REINFORCEMENT LEARNING USING ADVANTAGE ESTIMATES
2y 5m to grant Granted Mar 24, 2026
16/994,396
Patent 12547759
PRIVACY PRESERVING MACHINE LEARNING MODEL TRAINING
2y 5m to grant Granted Feb 10, 2026
18/514,482
Patent 12530613
SYSTEMS AND METHODS FOR PERFORMING QUANTUM EVOLUTION IN QUANTUM COMPUTATION
2y 5m to grant Granted Jan 20, 2026
18/137,812
Patent 12518214
DISTRIBUTED MACHINE LEARNING SYSTEMS INCLUDING GENERATION OF SYNTHETIC DATA
2y 5m to grant Granted Jan 06, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
37%
Grant Probability
55%
With Interview (+18.0%)
4y 8m
Median Time to Grant
Low
PTA Risk
Based on 134 resolved cases by this examiner. Grant probability derived from career allow rate.