Last updated: April 19, 2026
Application No. 18/334,112
JOINTLY LEARNING EXPLORATORY AND NON-EXPLORATORY ACTION SELECTION POLICIES

Non-Final OA §101§102§103§112
Filed
Jun 13, 2023
Examiner
LEE, MICHAEL CHRISTOPHER
Art Unit
2128
Tech Center
2100 — Computer Architecture & Software
Assignee
Deepmind Technologies Limited
OA Round
1 (Non-Final)
This examiner grants 59% of cases after interview

— +27.1% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 136 resolved cases, 2023–2026
Examiner Intelligence

LEE, MICHAEL CHRISTOPHER View full profile →
Grants 59% of resolved cases
Career Allow Rate
80 granted / 136 resolved
+3.8% vs TC avg
Strong +27% interview lift
Without
With
+27.1%
Interview Lift
resolved cases with interview
Typical timeline
3y 2m
Avg Prosecution
54 currently pending
Career history
190
Total Applications
across all art units
Statute-Specific Performance

§101
29.1%
-10.9% vs TC avg
§103
45.0%
+5.0% vs TC avg
§102
11.5%
-28.5% vs TC avg
§112
12.3%
-27.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 136 resolved cases
Office Action

§101 §102 §103 §112
DETAILED ACTION
Notice of AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Regarding U.S. Patent Application Serial No. 16/881,180, filed May 22, 2020 and U.S. Provisional Patent Application Nos. 62/905,995, filed September 25, 2019, and 62/852,137, filed May 23, 2019, Applicant’s claim for the benefit of a prior-filed application under 35 U.S.C. 119(e) or under 35 U.S.C. 120 is acknowledged. 
Information Disclosure Statement
The information disclosure statements submitted on 6/13/2023 and 7/29/2024 have been considered.  
Preliminary Amendment
Applicant’s Preliminary Amendment provided on 10/6/2023 has been considered.  Claims 1-20 are pending.
Claim Objections
Claim 1 is objected to because of the following informalities: 
In claim 1, the limitations at line 10 “the action selection network implements...” and line 12 “the plurality of action selection policies” should be further indented.
Appropriate correction is required.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


Claims 1-18 are rejected under 35 U.S.C. 112(b) as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor regards as the invention.
Claim 1 recites the limitation "the exploration importance factor" in line 8.  There is insufficient antecedent basis for this limitation in the claim.  For purposes of compact prosecution, this limitation will be interpreted as “an exploration importance factor”, consistent with claims 19 and 20.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-9 and 12-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 

Regarding Step 1 of the Alice/Mayo framework, Claims 1-18 are directed to a method (a process), Claim 19 is directed to a system (a machine), and Claim 20 is directed to one or more non-transitory computer-readable storage media (an article of manufacture), which each fall within one of the four statutory categories of inventions.

Regarding Claim 1
Step 2A, prong 1 (Is the claim directed to a law of nature, a natural phenomenon or an abstract idea).
Claim 1 recites the following mental processes, that in each case under the broadest reasonable interpretation, covers performance of the limitation in the mind (including an observation, evaluation, judgment, opinion) or with the aid of pencil and paper but for the recitation of generic computer components (e.g., “computers”, “neural network”). 
selecting an action to be performed by an agent interacting with an environment (under the broadest reasonable interpretation, this limitation can be performed mentally by a human, for example, a human can select an action to be performed by an agent interacting with an environment, such as an action directing a human agent to explore a particular quadrant of a room)
processing; (i) the observation, and (ii) the exploration importance factor, ... to generate an action selection output (under the broadest reasonable interpretation, this limitation can be performed mentally by a human, for example, a human can select an action to be performed by an agent interacting with an environment, such as an action directing a human agent to explore a particular quadrant of a room, where such selected action is based on an observation (e.g., an interesting object in the quadrant) and an exploration importance factor (e.g., the quadrant has not yet been explored, so the exploration importance factor is high and taken into consideration when determining whether to select this particular action))
... implements a plurality of action selection policies indexed by a set of possible exploration importance factors (under the broadest reasonable interpretation, this limitation can be performed mentally by a human, for example, a human can mentally implement 2 or more action selection policies indexed by a set of possible exploration importance factors, for example, a human can mentally decide between two possible actions, indexed by exploration importance factors of 0 and 1)
the plurality of action selection policies comprise an exploitative action selection policy and one or more exploratory action selection policies; (This limitation further describes the poolicies.  Under the broadest reasonable interpretation, this limitation can be performed mentally by a human, for example, a human can mentally implement 2 or more action selection policies indexed by a set of possible exploration importance factors, for example, a human can mentally decide between exploring a quadrant (having exploration importance factor 1) and performing an exploitative action (such as touching an object, having an exploration importance factor 0))
selecting an action to be performed by the agent using the action selection output (under the broadest reasonable interpretation, this limitation can be performed mentally by a human, for example, a human can select an action to be performed by an agent interacting with an environment, such as an action directing a human agent to explore a particular quadrant of a room)

Step 2A, prong 2 (Does the claim recite additional elements that integrate the judicial exception into a practical application?).
The judicial exception is not integrated into a practical application.  In particular, the claim recites the additional elements (e.g., “computers”, “neural network”) which are recited at a high-level of generality such that they amount to no more than mere instructions to apply the exception using a generic computer component (See MPEP 2106.05(f)). 
Regarding the “A method performed by one or more computers for..., the method comprising:” limitation, such limitation is recited at a high-level of generality and amounts to no more than adding the words “apply it” (or an equivalent) with the judicial exception.  In particular, the claim only recites the additional element of a computer.  This additional element is recited at a high-level of generality and amounts to no more than mere instructions to apply the exception using a generic computer component (a computer).  Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea (See MPEP 2106.05(f)).
Regarding the “receiving an observation characterizing a current state of the environment” limitation, such additional element of a data gathering step is recited at a high level of generality and amounts to extra-solution activity of receiving data, i.e. pre-solution activity of gathering data for use in the claimed process (see MPEP 2106.05(g)).  
Regarding the “... using an action selection neural network...” limitation, such limitation is recited at a high-level of generality and amounts to no more than adding the words “apply it” (or an equivalent) with the judicial exception.  In particular, the claim only recites the additional element of a neural network.  This additional element is recited at a high-level of generality and amounts to no more than mere instructions to apply the exception using a generic computer component (a neural network).  Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea (See MPEP 2106.05(f)).
Regarding the “ the action selection neural network ...;” limitation, such limitation is recited at a high-level of generality and amounts to no more than adding the words “apply it” (or an equivalent) with the judicial exception.  In particular, the claim only recites the additional element of a neural network.  This additional element is recited at a high-level of generality and amounts to no more than mere instructions to apply the exception using a generic computer component (a neural network).  Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea (See MPEP 2106.05(f)).

Step 2B (Does the claim recite additional elements that amount to significantly more than the judicial exception?)
	In accordance with Step 2B, the claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception.  As discussed above, the additional elements (e.g., “computers”, “neural network”) are recited at a high-level of generality such that they amount to no more than mere instructions to apply the exception using a generic computer component (See MPEP 2106.05(f)).
	Regarding the “A method performed by one or more computers for..., the method comprising:” limitation, such limitation is recited at a high-level of generality and amounts to no more than adding the words “apply it” (or an equivalent) with the judicial exception, because the limitation merely provides instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea.  Accordingly, this additional element does not add significantly more than the judicial exception. (See MPEP 2106.05(f)).
Regarding the “receiving an observation characterizing a current state of the environment” limitation, as discussed above, the additional element of a data gathering step is recited at a high level of generality and amounts to extra-solution activity of receiving data, i.e. pre-solution activity of gathering data for use in the claimed process.  The courts have found limitations directed to obtaining information electronically, recited at a high level of generality, to be well-understood, routine, and conventional (see MPEP 2106.05(d)(II), “receiving or transmitting data over a network”, "electronic record keeping," and "storing and retrieving information in memory").
Regarding the “... using an action selection neural network...” limitation, such limitation is recited at a high-level of generality and amounts to no more than adding the words “apply it” (or an equivalent) with the judicial exception, because the limitation merely provides instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea.  Accordingly, this additional element does not add significantly more than the judicial exception. (See MPEP 2106.05(f)).
Regarding the “ the action selection neural network ...;” limitation, such limitation is recited at a high-level of generality and amounts to no more than adding the words “apply it” (or an equivalent) with the judicial exception, because the limitation merely provides instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea.  Accordingly, this additional element does not add significantly more than the judicial exception. (See MPEP 2106.05(f)).

Regarding Claim 2
Step 2A, Prong 1
wherein the exploration importance factor ... is a baseline exploration importance factor causing the agent to implement the exploitative action selection policy.  (under the broadest reasonable interpretation, this limitation can be performed mentally by a human, for example, a human can mentally consider the exploration importance factor in the manner recited by this limitation, such that a baseline exploration importance factor means that the human mentally selects an exploitative action and not an exploration action)

Step 2A, Prong 2
	Regarding the “processed by the action selection neural network” limitation, such limitation is recited at a high-level of generality and amounts to no more than adding the words “apply it” (or an equivalent) with the judicial exception, because the limitation merely provides instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea.  Accordingly, this additional element does not add significantly more than the judicial exception. (See MPEP 2106.05(f)).
Step 2B
Regarding the “processed by the action selection neural network” limitation, such limitation is recited at a high-level of generality and amounts to no more than adding the words “apply it” (or an equivalent) with the judicial exception, because the limitation merely provides instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea.  Accordingly, this additional element does not add significantly more than the judicial exception. (See MPEP 2106.05(f)).

Regarding Claim 3
Step 2A, Prong 1
wherein baseline exploration importance factor has value zero. (under the broadest reasonable interpretation, this limitation can be performed mentally by a human, for example, a human can mentally consider the exploration importance factor in the manner recited by this limitation, such that a baseline exploration importance factor of zero means that the human mentally selects an exploitative action and not an exploration action)

Regarding Step 2A, Prong 2, the claim does not include any additional elements that integrate the judicial exception into a practical application and regarding Step 2B, there are no additional elements recited that amount to significantly more than the judicial exception.

Regarding Claim 4
Step 2A, Prong 2
Regarding the “wherein the action selection policies have been trained to optimize an overall reward comprising a task reward and an exploration reward” limitation, such limitation is recited at a high-level of generality and amounts to no more than adding the words “apply it” (or an equivalent) with the judicial exception.  In particular, the claim only recites the additional element of training a neural network to optimize a particular parameter.  This additional element is recited at a high-level of generality and amounts to no more than mere instructions to apply the exception using a generic computer component (generic training of a neural network).  Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea (See MPEP 2106.05(f)).

Step 2B
Regarding the “wherein the action selection policies have been trained to optimize an overall reward comprising a task reward and an exploration reward” limitation, such limitation is recited at a high-level of generality and amounts to no more than adding the words “apply it” (or an equivalent) with the judicial exception, because the limitation merely provides instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea.  Accordingly, this additional element does not add significantly more than the judicial exception. (See MPEP 2106.05(f)).

Regarding Claim 5
Step 2A, Prong 2
Regarding the “wherein the exploitative action selection policy is an action selection policy that is trained to optimize an overall reward independent of the exploration reward” limitation, such limitation is recited at a high-level of generality and amounts to no more than adding the words “apply it” (or an equivalent) with the judicial exception.  In particular, the claim only recites the additional element of training a neural network to optimize a particular parameter.  This additional element is recited at a high-level of generality and amounts to no more than mere instructions to apply the exception using a generic computer component (generic training of a neural network).  Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea (See MPEP 2106.05(f)).

Step 2B
Regarding the “wherein the exploitative action selection policy is an action selection policy that is trained to optimize an overall reward independent of the exploration reward” limitation, such limitation is recited at a high-level of generality and amounts to no more than adding the words “apply it” (or an equivalent) with the judicial exception, because the limitation merely provides instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea.  Accordingly, this additional element does not add significantly more than the judicial exception. (See MPEP 2106.05(f)).

Regarding Claim 6
Step 2A, Prong 1
wherein the exploitative action selection policy causes the agent to exploit its current knowledge of the environment.  (under the broadest reasonable interpretation, this limitation can be performed mentally by a human, for example, a human can select an exploitative action, such as picking up an object based on the agent’s current knowledge that there is a nearby object)

Regarding Step 2A, Prong 2, the claim does not include any additional elements that integrate the judicial exception into a practical application and regarding Step 2B, there are no additional elements recited that amount to significantly more than the judicial exception.

Regarding Claim 7
Step 2A, Prong 1
determining an exploration reward based on: (i) a subsequent observation characterizing a state of the environment after the agent performs the selected action and (ii) one or more prior observations characterizing states of the environment prior to the agent performing the selected action; (under the broadest reasonable interpretation, this limitation can be performed mentally by a human, for example, a human can determine an exploration reward based on the recited criteria)

Step 2A, Prong 2
Regarding the “training the action selection neural network using a reinforcement learning technique based at least in part on the exploration reward” limitation, such limitation is recited at a high-level of generality and amounts to no more than adding the words “apply it” (or an equivalent) with the judicial exception.  In particular, the claim only recites the additional element of training a neural network using generic reinforcement learning techniques to optimize a particular parameter.  This additional element is recited at a high-level of generality and amounts to no more than mere instructions to apply the exception using a generic computer component (generic training of a neural network using generic reinforcement learning techniques).  Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea (See MPEP 2106.05(f)).

Step 2B
Regarding the “training the action selection neural network using a reinforcement learning technique based at least in part on the exploration reward” limitation, such limitation is recited at a high-level of generality and amounts to no more than adding the words “apply it” (or an equivalent) with the judicial exception, because the limitation merely provides instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea.  Accordingly, this additional element does not add significantly more than the judicial exception. (See MPEP 2106.05(f)).

Regarding Claim 8
Step 2A, Prong 1
prior to training the action selection neural network using the reinforcement learning technique based at least in part on the exploration reward: modifying the exploration reward based on the exploration importance factor.  (under the broadest reasonable interpretation, this limitation can be performed mentally by a human, for example, a human can modify an exploration reward based on the exploration importance factor (e.g., multiplication) prior to training a neural network)

Regarding Step 2A, Prong 2, the claim does not include any additional elements that integrate the judicial exception into a practical application and regarding Step 2B, there are no additional elements recited that amount to significantly more than the judicial exception.

Regarding Claim 9
Step 2A, Prong 1
wherein modifying the exploration reward based on the exploration importance factor comprises: scaling the exploration reward by the exploration importance factor.  (under the broadest reasonable interpretation, this limitation can be performed mentally by a human, for example, a human can modify an exploration reward based on the exploration importance factor (e.g., multiplication) prior to training a neural network)

Regarding Step 2A, Prong 2, the claim does not include any additional elements that integrate the judicial exception into a practical application and regarding Step 2B, there are no additional elements recited that amount to significantly more than the judicial exception.

Regarding Claim 12
Step 2A, Prong 2
Regarding the “wherein the reinforcement learning technique is a Retrace double Q-learning technique with prioritized experience replay” limitation, such limitation amounts to no more than generally linking the use of a judicial exception to a particular technological environment or field of use (reinforcement learning using a particular reinforcement learning technique). As explained by the Supreme Court, a claim directed to a judicial exception cannot be made eligible "simply by having the applicant acquiesce to limiting the reach of the patent for the formula to a particular technological use." Diamond v. Diehr, 450 U.S. 175, 192 n.14, 209 USPQ 1, 10 n. 14 (1981). Thus, limitations that amount to merely indicating a field of use or technological environment in which to apply a judicial exception do not integrate a judicial exception into a practical application.

Step 2B
Regarding the “wherein the reinforcement learning technique is a Retrace double Q-learning technique with prioritized experience replay” limitation, such limitation amounts to no more than generally linking the use of a judicial exception to a particular technological environment or field of use as explained above, which does not amount to significantly more than the judicial exception.  MPEP 2106.05(h).

Regarding Claim 13
Step 2A, Prong 1
wherein the set of possible exploration importance factors is a discrete set.  (under the broadest reasonable interpretation, this limitation can be performed mentally by a human, for example, a human can mentally consider a discrete set of possible exploration importance factors, e.g., (0,1))

Regarding Step 2A, Prong 2, the claim does not include any additional elements that integrate the judicial exception into a practical application and regarding Step 2B, there are no additional elements recited that amount to significantly more than the judicial exception.

Regarding Claim 14
Step 2A, Prong 1
determining the exploration reward based on, for each prior observation, a similarity measure between the embedded representation of the prior observation and the embedded representation of the subsequent observation.  (under the broadest reasonable interpretation, this limitation can be performed mentally by a human, for example, a human can mentally determine the exploration reward based on a similarity measure such as a cosine similarity between the embedded representations)

Step 2A, Prong 2
Regarding the “providing the subsequent observation and the prior observations to an embedding neural network, wherein the embedding neural network is configured to process an input observation to generate an embedded representation of the input observation” limitation, such limitation is recited at a high-level of generality and amounts to no more than adding the words “apply it” (or an equivalent) with the judicial exception.  In particular, the claim only recites the additional element of using a neural network to generate a particular type of embedding.  This additional element is recited at a high-level of generality and amounts to no more than mere instructions to apply the exception using a generic computer component (a neural network to generate a particular type of embedding).  Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea (See MPEP 2106.05(f)).

Step 2B
Regarding the “providing the subsequent observation and the prior observations to an embedding neural network, wherein the embedding neural network is configured to process an input observation to generate an embedded representation of the input observation” limitation, such limitation is recited at a high-level of generality and amounts to no more than adding the words “apply it” (or an equivalent) with the judicial exception, because the limitation merely provides instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea.  Accordingly, this additional element does not add significantly more than the judicial exception. (See MPEP 2106.05(f)).

Regarding Claim 15
Step 2A, Prong 2
Regarding the “wherein the embedding neural network has been trained to process an input observation to generate an embedded representation of the input observation that characterizes aspects of the state characterized by the input observation that are controllable by the agent” limitation, such limitation is recited at a high-level of generality and amounts to no more than adding the words “apply it” (or an equivalent) with the judicial exception.  In particular, the claim only recites the additional element of using a neural network to generate a particular type of embedding.  This additional element is recited at a high-level of generality and amounts to no more than mere instructions to apply the exception using a generic computer component (a neural network to generate a particular type of embedding).  Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea (See MPEP 2106.05(f)).

Step 2B
Regarding the “wherein the embedding neural network has been trained to process an input observation to generate an embedded representation of the input observation that characterizes aspects of the state characterized by the input observation that are controllable by the agent” limitation, such limitation is recited at a high-level of generality and amounts to no more than adding the words “apply it” (or an equivalent) with the judicial exception, because the limitation merely provides instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea.  Accordingly, this additional element does not add significantly more than the judicial exception. (See MPEP 2106.05(f)).

Regarding Claim 16
Step 2A, Prong 2
Regarding the “wherein training the embedding neural network comprises jointly training the embedding neural network with an action prediction neural network” limitation, such limitation is recited at a high-level of generality and amounts to no more than adding the words “apply it” (or an equivalent) with the judicial exception.  In particular, the claim only recites the additional element of jointly training neural networks.  This additional element is recited at a high-level of generality and amounts to no more than mere instructions to apply the exception using a generic computer component (generic joint training).  Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea (See MPEP 2106.05(f)).
Regarding the “wherein: the action prediction neural network is configured to generate a prediction for an action that caused the state of the environment to transition from a state characterized by a given observation at a given time step to a state characterized by a subsequent observation at a subsequent time step by processing respective embedded representations of the given observation and the subsequent observation that are generated using the embedding neural network” limitation, such limitation is recited at a high-level of generality and amounts to no more than adding the words “apply it” (or an equivalent) with the judicial exception.  In particular, the claim only recites the additional element of a neural network configured to generate a particular output.  This additional element is recited at a high-level of generality and amounts to no more than mere instructions to apply the exception using a generic computer component (a generic neural network).  Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea (See MPEP 2106.05(f)).

Step 2B
Regarding the “wherein training the embedding neural network comprises jointly training the embedding neural network with an action prediction neural network” limitation, such limitation is recited at a high-level of generality and amounts to no more than adding the words “apply it” (or an equivalent) with the judicial exception, because the limitation merely provides instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea.  Accordingly, this additional element does not add significantly more than the judicial exception. (See MPEP 2106.05(f)).
Regarding the “wherein: the action prediction neural network is configured to generate a prediction for an action that caused the state of the environment to transition from a state characterized by a given observation at a given time step to a state characterized by a subsequent observation at a subsequent time step by processing respective embedded representations of the given observation and the subsequent observation that are generated using the embedding neural network” limitation, such limitation is recited at a high-level of generality and amounts to no more than adding the words “apply it” (or an equivalent) with the judicial exception, because the limitation merely provides instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea.  Accordingly, this additional element does not add significantly more than the judicial exception. (See MPEP 2106.05(f)).

Regarding Claim 17
Step 2A, Prong 2
Regarding the “wherein training the embedding neural network comprises jointly training the embedding neural network with a state prediction neural network” limitation, such limitation is recited at a high-level of generality and amounts to no more than adding the words “apply it” (or an equivalent) with the judicial exception.  In particular, the claim only recites the additional element of jointly training neural networks.  This additional element is recited at a high-level of generality and amounts to no more than mere instructions to apply the exception using a generic computer component (generic joint training).  Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea (See MPEP 2106.05(f)).
	Regarding the “wherein: the state prediction neural network is configured to process an input comprising: (i) an embedded representation of a given observation that is generated using the embedding neural network and that characterizes the state of the environment at a given time step, and (ii) a representation of an action performed by the agent at the given time step, to generate an output characterizing a predicted subsequent state of the environment at a subsequent time step that follows the given time step” limitation, such limitation is recited at a high-level of generality and amounts to no more than adding the words “apply it” (or an equivalent) with the judicial exception.  In particular, the claim only recites the additional element of a neural network configured to generate a particular output.  This additional element is recited at a high-level of generality and amounts to no more than mere instructions to apply the exception using a generic computer component (a generic neural network).  Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea (See MPEP 2106.05(f)).

Step 2B
Regarding the “wherein training the embedding neural network comprises jointly training the embedding neural network with a state prediction neural network” limitation, such limitation is recited at a high-level of generality and amounts to no more than adding the words “apply it” (or an equivalent) with the judicial exception, because the limitation merely provides instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea.  Accordingly, this additional element does not add significantly more than the judicial exception. (See MPEP 2106.05(f)).
	Regarding the “wherein: the state prediction neural network is configured to process an input comprising: (i) an embedded representation of a given observation that is generated using the embedding neural network and that characterizes the state of the environment at a given time step, and (ii) a representation of an action performed by the agent at the given time step, to generate an output characterizing a predicted subsequent state of the environment at a subsequent time step that follows the given time step” limitation, such limitation is recited at a high-level of generality and amounts to no more than adding the words “apply it” (or an equivalent) with the judicial exception, because the limitation merely provides instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea.  Accordingly, this additional element does not add significantly more than the judicial exception. (See MPEP 2106.05(f)).

Regarding Claim 18
Step 2A, Prong 1
wherein determining the exploration reward based on, for each prior observation, a similarity measure between the embedded representation of the prior observation and the embedded representation of the subsequent observation comprises (under the broadest reasonable interpretation, this limitation can be performed mentally by a human, for example, a human can mentally determine the exploration reward based on a similarity measure, such as a cosine similarity, between the respective embedded representations)
determining a combined similarity measure based on the similarity measures between the embedded representation of the subsequent observation and the embedded representations of the prior observations; and (under the broadest reasonable interpretation, this limitation can be performed mentally by a human, for example, a human can mentally determine a similarity measure, such as a cosine similarity, between the respective embedded representations, and then add such cosine similarities together for each prior observation)
determining the exploration reward based on an inverse of the combined similarity measure (under the broadest reasonable interpretation, this limitation can be performed mentally by a human, for example, a human can mentally take the inverse of the combined similarity measure, e.g., 1/similarity measure)

Regarding Step 2A, Prong 2, the claim does not include any additional elements that integrate the judicial exception into a practical application and regarding Step 2B, there are no additional elements recited that amount to significantly more than the judicial exception.

Regarding Claim 19
Step 2A, Prong 1
	Claim 19 recites a system that corresponds to the method of claim 1, and therefore the analysis under Step 2A, Prong 1 with respect to claim 1 also applies to this claim 19.  While claim 19 recites additional generic computing components (“computers”, “storage devices”, “neural network”, and “instructions”), such additional generic computing components do not change the analysis under Step 2A, Prong 1.

Step 2A, Prong 2
	Claim 19 recites a system that corresponds to the method of claim 1, and therefore the analysis under Step 2A, Prong 2 with respect to claim 1 also applies to this claim 19.  While claim 19 recites additional generic computing components (“computers”, “storage devices”, “neural network”, and “instructions”), such additional generic computing components do not change the analysis under Step 2A, Prong 2.

Step 2B
	Claim 19 recites a system that corresponds to the method of claim 1, and therefore the analysis under Step 2B with respect to claim 1 also applies to this claim 19.  While claim 19 recites additional generic computing components (“computers”, “storage devices”, “neural network”, and “instructions”), such additional generic computing components do not change the analysis under Step 2B.

Regarding Claim 20
Step 2A, Prong 1
	Claim 20 recites one or more non-transitory computer storage media that corresponds to the method of claim 1, and therefore the analysis under Step 2A, Prong 1 with respect to claim 1 also applies to this claim 20.  While claim 20 recites additional generic computing components (“computers”, “non-transitory computer storage media”, “neural network”, and “instructions”), such additional generic computing components do not change the analysis under Step 2A, Prong 1.

Step 2A, Prong 2
	Claim 20 recites one or more non-transitory computer storage media that corresponds to the method of claim 1, and therefore the analysis under Step 2A, Prong 2 with respect to claim 1 also applies to this claim 20.  While claim 20 recites additional generic computing components (“computers”, “non-transitory computer storage media”, “neural network”, and “instructions”), such additional generic computing components do not change the analysis under Step 2A, Prong 2.

Step 2B
	Claim 20 recites one or more non-transitory computer storage media that corresponds to the method of claim 1, and therefore the analysis under Step 2B with respect to claim 1 also applies to this claim 20.  While claim 20 recites additional generic computing components (“computers”, “non-transitory computer storage media”, “neural network”, and “instructions”), such additional generic computing components do not change the analysis under Step 2B.

Claims 10-11 are Subject Matter Eligible
	Regarding dependent claim 10, under Step 2A, Prong 2, the examiner finds that such claim sufficiently reflects an improvement to reinforcement learning with respect to neural network agents.  MPEP 2106.04(d)(1).  Para. 0037 of the instant specification explains that learning the “exploratory policies enables the system to continually train the action selection neural network even if the task rewards are sparse, e.g., rarely non-zero.”  Para. 0039 of the instant specification explains that “[t]raining the action selection neural network using exploration rewards based on embeddings that characterize controllable aspects of the state of the environment can enable the agent to more effectively explore environments with complex state spaces and inherent randomness, e.g., real-world environments with distractor objects and variations in lighting.”  This improvement is reflected by at least the “receiving a task reward that characterizes a progress of the agent towards accomplishing a task after the agent performs the selected action; and determining an overall reward based on the task reward and the modified exploration reward; wherein training the action selection neural network using the reinforcement learning technique based at least in part on the exploration reward comprises: training the action selection neural network using the reinforcement learning technique based on the overall reward” limitations.  Therefore, claim 10 is found to be subject matter eligible.
	Claim 11 depends from claim 10 and is subject matter eligible for the same reasons explained with respect to claim 10.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1-3, 6, and 19-20 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by US 20190156197 A1, hereinafter referenced as DASGUPTA.

Regarding Claim 1
	DASGUPTA teaches:
A method performed by one or more computers (DASGUPTA, para. 0077: “An apparatus or method may be constituted by realizing the operation or processing of information in accordance with the usage of the computer 800”) for selecting an action to be performed by an agent interacting with an environment, the method comprising: (DASGUPTA, para. 0032: “FIG. 2 shows a diagram of the interaction between an apparatus 200 for adaptive exploration in deep reinforcement learning and an environment 232, according to an embodiment of the present invention. The diagram includes apparatus 200, which continuously observes a state 234 of environment 232, receives reward information concerning the actions performed in environment 232, feeds state 234 into a function approximator 220, and determines actions to be performed in environment 232 using an action selection policy 222.”;
DASGUPTA, para. 0049: “At S442, a causing section, such as causing section 104, may cause the selected action to be performed.  For example, the causing section may cause the action selected at S450 to be performed in the subsequent time frame of the action and observation sequence. Depending on the nature of the action and observation sequence, actions may be caused through instructions or physical interaction, such as in the case of a human or machine, in which case the actions may be performed by the human or the machine, or caused by issuing commands to be executed by the computer program, in which case the actions are performed by the computer program.”
Examiner’s Note:  Causing section 104 (depicted in Fig. 1) causes a human or action , which corresponds to the recited “agent” that performs a selected action in environment 232)
receiving an observation characterizing a current state of the environment; (DASGUPTA, para. 0024: “Obtaining section 105 may obtain observations. For example, obtaining section 105 may be configured to obtain an observation of the subsequent time frame of action and observation sequence 130.”
DASGUPTA, para. 0032: “The diagram includes apparatus 200, which continuously observes a state 234 of environment 232”;
Examiner’s Note: Obtaining section 105 obtains observations of states of environment 232)
processing: (i) the observation, and (ii) the exploration importance factor, using an action selection neural network to generate an action selection output, wherein: (DASGUPTA, para. 0016: “Embodiments of the present invention may adaptively change the exploration term (ε) of an ε-greedy policy with dependence on the continuous performance of a function approximator, such as a deep neural network”;
DASGUPTA, para. 0021: Selecting section 102 may select actions from among a plurality of possible actions. For example, selecting section 102 may be configured to select an action according to an action selection policy with which to proceed from a current time frame of action and observation sequence 130 to a subsequent time frame of action and observation sequence 130.”;
DASGUPTA, para. 0032: “FIG. 2 shows a diagram of the interaction between an apparatus 200 for adaptive exploration in deep reinforcement learning and an environment 232, according to an embodiment of the present invention. The diagram includes apparatus 200, which continuously observes a state 234 of environment 232, receives reward information concerning the actions performed in environment 232, feeds state 234 into a function approximator 220, and determines actions to be performed in environment 232 using an action selection policy 222. In some embodiments, the environment may be a computer program, such as a game, and as such environment 232 may be entirely digitally created. Such a computer program may be observed by receiving data output from the program, and actions may be caused by issuing commands to be executed by the computer program. Function approximator 220 may be a neural network, such as a deep neural network (DNN), including Deep Q-Networks (DQNs), or any other type of neural network capable of function approximation.”;
DASGUPTA, paras. 0039-0040:

    PNG
    media_image1.png
    262
    476
    media_image1.png
    Greyscale

DASGUPTA, para. 0037: “At S340, an inputting section, such as inputting section 101, may input a current time frame of an action and observation sequence, such as action and observation sequence 130, sequentially into a function approximator, such as function approximator 120, which includes a plurality of parameters, the action and observation sequence including a plurality of time frames, each time frame including action values and observation values.”
DASGUPTA, para. 0053: “At S551, a selecting section, such as selecting section 102, may determine whether to select a random action or an action according to a probability function. For example, the selecting section may select one of a random action and a greedy action, wherein the random action has a probability of being selected equal to an exploration term. In this embodiment, a random number is generated between 0 and 1. If the random number is greater than the exploration term, which is also a number between 0 and 1, then the operation proceeds to S554. If the random number is not greater than the exploration term, then the operation proceeds to S552.”
Examiner’s Note: Apparatus 200 (as shown in Fig. 2), which includes neural network function approximation 220, and that outputs an action, corresponds to the recited “action selection neural network” and uses the observation sequence and the exploration term (ε) (corresponding to recited “exploration importance factor”) to determine a selected output according to an action selection policy; the examiner notes that exploration term (ε) is utilized to select either a random action (corresponding to an “exploration” action) or a greedy action, and therefore adjusting the exploration term (ε) upwards (towards 1) would favor exploration, whereas setting the exploration term (ε) at the bottom (0) would ensure a greedy action)
the action selection neural network implements a plurality of action selection policies indexed by a set of possible exploration importance factors; and (DASGUPTA, paras. 0039-0040:

    PNG
    media_image1.png
    262
    476
    media_image1.png
    Greyscale

DASGUPTA, para. 0053: “At S551, a selecting section, such as selecting section 102, may determine whether to select a random action or an action according to a probability function. For example, the selecting section may select one of a random action and a greedy action, wherein the random action has a probability of being selected equal to an exploration term. In this embodiment, a random number is generated between 0 and 1. If the random number is greater than the exploration term, which is also a number between 0 and 1, then the operation proceeds to S554. If the random number is not greater than the exploration term, then the operation proceeds to S552.”
Examiner’s Note: Equation (1) of DASGUPTA shows an action selection policy, which is indexed according to the exploration term (ε), where each different possible value of exploration term (ε) from 0 to 1 corresponds to an index for a different version of the action selection policy)
the plurality of action selection policies comprise an exploitative action selection policy and one or more exploratory action selection policies; and (DASGUPTA, paras. 0039-0040:

    PNG
    media_image1.png
    262
    476
    media_image1.png
    Greyscale

DASGUPTA, para. 0053: “At S551, a selecting section, such as selecting section 102, may determine whether to select a random action or an action according to a probability function. For example, the selecting section may select one of a random action and a greedy action, wherein the random action has a probability of being selected equal to an exploration term. In this embodiment, a random number is generated between 0 and 1. If the random number is greater than the exploration term, which is also a number between 0 and 1, then the operation proceeds to S554. If the random number is not greater than the exploration term, then the operation proceeds to S552.”
Examiner’s Note: Equation (1) of DASGUPTA shows an action selection policy, where if a random number is below exploration term (ε), exploration happens (corresponding to recited “one or more exploratory action selection policies”) and if above exploration term (ε), a greedy action happens (corresponding to recited “exploitative action selection policy”))
selecting an action to be performed by the agent using the action selection output. (DASGUPTA, para. 0021: Selecting section 102 may select actions from among a plurality of possible actions. For example, selecting section 102 may be configured to select an action according to an action selection policy with which to proceed from a current time frame of action and observation sequence 130 to a subsequent time frame of action and observation sequence 130.”;
DASGUPTA, para. 0032: “FIG. 2 shows a diagram of the interaction between an apparatus 200 for adaptive exploration in deep reinforcement learning and an environment 232, according to an embodiment of the present invention. The diagram includes apparatus 200, which continuously observes a state 234 of environment 232, receives reward information concerning the actions performed in environment 232, feeds state 234 into a function approximator 220, and determines actions to be performed in environment 232 using an action selection policy 222.”
DASGUPTA, para. 0049: “At S442, a causing section, such as causing section 104, may cause the selected action to be performed.  For example, the causing section may cause the action selected at S450 to be performed in the subsequent time frame of the action and observation sequence. Depending on the nature of the action and observation sequence, actions may be caused through instructions or physical interaction, such as in the case of a human or machine, in which case the actions may be performed by the human or the machine, or caused by issuing commands to be executed by the computer program, in which case the actions are performed by the computer program.”
Examiner’s Note: the causing section 104 causes a human or machine (the recited “agent”) to perform the selected action)

Regarding Claim 2
	DASGUPTA teaches the method of claim 1 as explained above.  DASGUPTA further teaches:
wherein the exploration importance factor processed by the action selection neural network is a baseline exploration importance factor causing the agent to implement the exploitative action selection policy.  (DASGUPTA, paras. 0039-0040:

    PNG
    media_image1.png
    262
    476
    media_image1.png
    Greyscale

DASGUPTA, para. 0053: “At S551, a selecting section, such as selecting section 102, may determine whether to select a random action or an action according to a probability function. For example, the selecting section may select one of a random action and a greedy action, wherein the random action has a probability of being selected equal to an exploration term. In this embodiment, a random number is generated between 0 and 1. If the random number is greater than the exploration term, which is also a number between 0 and 1, then the operation proceeds to S554. If the random number is not greater than the exploration term, then the operation proceeds to S552.”
Examiner’s Note: The exploration term (ε) is between 0 to 1, where at the baseline (0) it means that only the greedy action (corresponding to the recited “exploitative action selection policy”) is implemented)

Regarding Claim 3
	DASGUPTA teaches the method of claim 2 as explained above.  DASGUPTA further teaches:
wherein baseline exploration importance factor has value zero.  (DASGUPTA, paras. 0039-0040:

    PNG
    media_image1.png
    262
    476
    media_image1.png
    Greyscale

DASGUPTA, para. 0053: “At S551, a selecting section, such as selecting section 102, may determine whether to select a random action or an action according to a probability function. For example, the selecting section may select one of a random action and a greedy action, wherein the random action has a probability of being selected equal to an exploration term. In this embodiment, a random number is generated between 0 and 1. If the random number is greater than the exploration term, which is also a number between 0 and 1, then the operation proceeds to S554. If the random number is not greater than the exploration term, then the operation proceeds to S552.”
Examiner’s Note: The exploration term (ε) is between 0 to 1, where at the baseline (0) it means that only the greedy action (corresponding to the recited “exploitative action selection policy”) is implemented)

Regarding Claim 6
	DASGUPTA teaches the method of claim 2 as explained above.  DASGUPTA further teaches:
wherein the exploitative action selection policy causes the agent to exploit its current knowledge of the environment.  (DASGUPTA, para. 0049: “At S442, a causing section, such as causing section 104, may cause the selected action to be performed. For example, the causing section may cause the action selected at S450 to be performed in the subsequent time frame of the action and observation sequence.”;
DASGUPTA, para. 0053: “At S551, a selecting section, such as selecting section 102, may determine whether to select a random action or an action according to a probability function. For example, the selecting section may select one of a random action and a greedy action, wherein the random action has a probability of being selected equal to an exploration term. In this embodiment, a random number is generated between 0 and 1. If the random number is greater than the exploration term, which is also a number between 0 and 1, then the operation proceeds to S554. If the random number is not greater than the exploration term, then the operation proceeds to S552.”;
DASGUPTA, para. 0056: “As operations S554 and S555 are iteratively performed, the selecting section evaluates each reward probability of a plurality of possible actions according to the probability function based on the value function to select the greedy action.”;
Examiner’s Note: The greedy action (corresponding to the recited “exploitative” action) is performed based on the current time state, and after execution proceeds to the subsequent time state as explained by para. 0049)

Regarding Claim 19
	DASGUPTA teaches:
A system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform operations for (DASGUPTA, para. 0072: “FIG. 8 shows an exemplary hardware configuration of a computer configured for adaptive exploration in deep reinforcement learning, according to an embodiment of the present invention. A program that is installed in the computer 800 can cause the computer 800 to function as or perform operations associated with apparatuses of the embodiments of the present invention or one or more sections (including modules, components, elements, etc.) thereof, and/or cause the computer 800 to perform processes of the embodiments of the present invention or steps thereof. Such a program may be executed by the CPU 800-12 to cause the computer 800 to perform certain operations associated with some or all of the blocks of flowcharts and block diagrams described herein.”;
DASGUPTA, para. 0077: A program is provided by computer readable media such as the DVD-ROM 800-01 or the IC card. The program is read from the computer readable media, installed into the hard disk drive 800-24, RAM 800-14, or ROM 800-30, which are also examples of computer readable media, and executed by the CPU 800-12.”
The remaining limitations in claim 19 correspond to the method of claim 1, and therefore claim 19 is rejected for the same reasons explained above with respect to claim 1.

Regarding Claim 20
	DASGUPTA teaches:
One or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations for   (DASGUPTA, para. 0077: A program is provided by computer readable media such as the DVD-ROM 800-01 or the IC card. The program is read from the computer readable media, installed into the hard disk drive 800-24, RAM 800-14, or ROM 800-30, which are also examples of computer readable media, and executed by the CPU 800-12.”)
The remaining limitations in claim 20 correspond to the method of claim 1, and therefore claim 20 is rejected for the same reasons explained above with respect to claim 1.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 4-5, and 7-11 are rejected under 35 U.S.C. 103 as being unpatentable over DASGUPTA in view of Stadie, Bradly C. "Incentivizing exploration in reinforcement learning with deep predictive models." arXiv preprint arXiv:1507.00814 (2015), hereinafter referenced as STADIE.

Regarding Claim 4
	DASGUPTA teaches the method of claim 1 as explained above.  However, DASGUPTA fails to explicitly teach:
wherein the action selection policies have been trained to optimize an overall reward comprising a task reward and an exploration reward.  

	However, in a related field of endeavor (reinforcement learning with deep predictive models with respect to agents facing “the exploration versus exploitation tradeoff”, see p. 1, section 1), STADIE teaches and makes obvious:
wherein the action selection policies have been trained to optimize an overall reward comprising a task reward and an exploration reward.  (STADIE, p. 2, section 2:

    PNG
    media_image2.png
    360
    788
    media_image2.png
    Greyscale

STADIE, p. 3, section 3, Algorithm 1: “line 17: return optimized policy π”;
Examiner’s Note: STADIE teaches an overall reward RBONUS, as shown in Equation 1, where RBONUS comprises R(s,a) (corresponding to recited task reward) and βN(s,a) (corresponding to recited exploration reward), and where Algorithm 1 performs reinforcement learning using such RBONUS (see line 6) to return an optimized policy (see line 17); the DASGUPTA-STADIE combination now modifies the action selection policies of DASGUPTA to utilize the RBONUS reward function when considering reward information (see para. 0032 of DASGUPTA))

	Before the effective filing date of the present application, it would have been obvious to one of ordinary skill in the art to combine the teachings of DASGUPTA with STADIE as explained above.  As disclosed by STADIE, one of ordinary skill would have been motivated to do so in order to incentivize exploration of novel states using exploration bonuses, which “can produce rapid and effective exploration.” (p. 1, section 1).  As further disclosed by STADIE, one of ordinary skill would be motivated to use the exploration bonuses of STADIE because using such techniques “can scale to large problems where Bayesian approaches to exploration become impractical.” (p. 2, section 1).

Regarding Claim 5
	DASGUPTA and STADIE teach the method of claim 4 as explained above.  However, DASGUPTA fails to explicitly teach:
wherein the exploitative action selection policy is an action selection policy that is trained to optimize an overall reward independent of the exploration reward.  

	However, in a related field of endeavor (reinforcement learning with deep predictive models with respect to agents facing “the exploration versus exploitation tradeoff”, see p. 1, section 1), STADIE teaches and makes obvious:
wherein the exploitative action selection policy is an action selection policy that is trained to optimize an overall reward independent of the exploration reward.  (STADIE, p. 3, section 3, Algorithm 1: “line 17: return optimized policy π”;
Examiner’s Note: STADIE teaches Algorithm 1 that performs reinforcement learning using such a reward function (see line 6) to return an optimized policy (see line 17); the DASGUPTA-STADIE combination now modifies the exploitative action selection policies of DASGUPTA (see para. 0039), which is a strict alternative to exploration, and is therefore entirely independent of the exploration bonus to utilize the, reward function R(s,a) of STADIE when considering reward information (see para. 0032 of DASGUPTA))


Before the effective filing date of the present application, it would have been obvious to one of ordinary skill in the art to combine the teachings of DASGUPTA with STADIE as explained above.  As disclosed by STADIE, one of ordinary skill would have been motivated to do so in order to incentivize exploration of novel states using exploration bonuses, which “can produce rapid and effective exploration.” (p. 1, section 1).  As further disclosed by STADIE, one of ordinary skill would be motivated to use the exploration bonuses of STADIE because using such techniques “can scale to large problems where Bayesian approaches to exploration become impractical.” (p. 2, section 1).

Regarding Claim 7
	DASGUPTA teaches the method of claim 1 as explained above.  However, DASGUPTA fails to explicitly teach:
determining an exploration reward based on: (i) a subsequent observation characterizing a state of the environment after the agent performs the selected action and (ii) one or more prior observations characterizing states of the environment prior to the agent performing the selected action; and 
training the action selection neural network using a reinforcement learning technique based at least in part on the exploration reward.  

However, in a related field of endeavor (reinforcement learning with deep predictive models with respect to agents facing “the exploration versus exploitation tradeoff”, see p. 1, section 1), STADIE teaches and makes obvious:

determining an exploration reward based on: (i) a subsequent observation characterizing a state of the environment after the agent performs the selected action and (ii) one or more prior observations characterizing states of the environment prior to the agent performing the selected action; and .  (STADIE, p. 3, section 3, Algorithm 1:

    PNG
    media_image3.png
    442
    668
    media_image3.png
    Greyscale

Examiner’s Note: As shown in lines 5-6, RBONUS is calculated based on st (corresponding to the current state, which is a state “prior to the agent performing the selected action”) and on 
    PNG
    media_image4.png
    24
    68
    media_image4.png
    Greyscale
 which as shown in line 4 is dependent on st+1 (corresponding to the next state which is “after the agent performs the selected action”); the DASGUPTA-STADIE combination now modifies the reward information of DASGUPTA (see para. 0032 of DASGUPTA) to utilize the incentive rewards for exploring as in STADIE)
training the action selection neural network using a reinforcement learning technique based at least in part on the exploration reward.  (STADIE, p. 3, section 3: “In our work, we draw on recent developments in representation learning with neural networks...”;
Examiner’s Note: STADIE teaches using reinforcement learning techniques with respect to training neural networks, and where Algorithm 1 shows that training is based at least in part on an exploration reward bonus; the DASGUPTA-STADIE combination now trains the action selection neural network of DASGUPTA to use reinforcement learning techniques using the exploration bonus reward of STADIE)

Before the effective filing date of the present application, it would have been obvious to one of ordinary skill in the art to combine the teachings of DASGUPTA with STADIE as explained above.  As disclosed by STADIE, one of ordinary skill would have been motivated to do so in order to incentivize exploration of novel states using exploration bonuses, which “can produce rapid and effective exploration.” (p. 1, section 1).  As further disclosed by STADIE, one of ordinary skill would be motivated to use the exploration bonuses of STADIE because using such techniques “can scale to large problems where Bayesian approaches to exploration become impractical.” (p. 2, section 1).

Regarding Claim 8
	DASGUPTA and STADIE teach the method of claim 7 as explained above.  However, DASGUPTA fails to explicitly teach:


prior to training the action selection neural network using the reinforcement learning technique based at least in part on the exploration reward: modifying the exploration reward based on the exploration importance factor.  

However, in a related field of endeavor (reinforcement learning with deep predictive models with respect to agents facing “the exploration versus exploitation tradeoff”, see p. 1, section 1), STADIE teaches and makes obvious:
prior to training the action selection neural network using the reinforcement learning technique based at least in part on the exploration reward: modifying the exploration reward based on the exploration importance factor.  (STADIE, p. 3, section 3, Algorithm 1:

    PNG
    media_image3.png
    442
    668
    media_image3.png
    Greyscale

STADIE, p. 3, section 3: “In our work, we draw on recent developments in representation learning with neural networks...”;
Examiner’s Note: prior to performing reinforcement learning as taught by STADIE, the DASGUPTA-STADIE combination now modifies the reward information of DASGUPTA (see para. 0032 of DASGUPTA) to utilize the incentive rewards for exploring as in STADIE, such that lines 5-6 of Algorithm 1 are modified to use the exploration term (ε) of DASGUPTA prior to running Algorithm 1 of STADIE)

Before the effective filing date of the present application, it would have been obvious to one of ordinary skill in the art to combine the teachings of DASGUPTA with STADIE as explained above.  As disclosed by STADIE, one of ordinary skill would have been motivated to do so in order to incentivize exploration of novel states using exploration bonuses, which “can produce rapid and effective exploration.” (p. 1, section 1).  As further disclosed by STADIE, one of ordinary skill would be motivated to use the exploration bonuses of STADIE because using such techniques “can scale to large problems where Bayesian approaches to exploration become impractical.” (p. 2, section 1).

Regarding Claim 9
	DASGUPTA and STADIE teach the method of claim 8 as explained above.  However, DASGUPTA fails to explicitly teach:
wherein modifying the exploration reward based on the exploration importance factor comprises: scaling the exploration reward by the exploration importance factor.  

However, in a related field of endeavor (reinforcement learning with deep predictive models with respect to agents facing “the exploration versus exploitation tradeoff”, see p. 1, section 1), STADIE teaches and makes obvious:
wherein modifying the exploration reward based on the exploration importance factor comprises: scaling the exploration reward by the exploration importance factor.  (STADIE, p. 3, section 3, Algorithm 1:

    PNG
    media_image3.png
    442
    668
    media_image3.png
    Greyscale

STADIE, p. 3, section 3: “In our work, we draw on recent developments in representation learning with neural networks...”;
Examiner’s Note: prior to performing reinforcement learning as taught by STADIE, the DASGUPTA-STADIE combination now modifies the reward information of DASGUPTA (see para. 0032 of DASGUPTA) to utilize the incentive rewards for exploring as in STADIE, such that lines 5-6 of Algorithm 1 are modified to replace (β) with the exploration term (ε) of DASGUPTA, which scales the exploration reward bonus portion)

Before the effective filing date of the present application, it would have been obvious to one of ordinary skill in the art to combine the teachings of DASGUPTA with STADIE as explained above.  As disclosed by STADIE, one of ordinary skill would have been motivated to do so in order to incentivize exploration of novel states using exploration bonuses, which “can produce rapid and effective exploration.” (p. 1, section 1).  As further disclosed by STADIE, one of ordinary skill would be motivated to use the exploration bonuses of STADIE because using such techniques “can scale to large problems where Bayesian approaches to exploration become impractical.” (p. 2, section 1).

Regarding Claim 10
	DASGUPTA and STADIE teach the method of claim 8 as explained above.  However, DASGUPTA fails to explicitly teach:
receiving a task reward that characterizes a progress of the agent towards accomplishing a task after the agent performs the selected action; and 
determining an overall reward based on the task reward and the modified exploration reward; 
wherein training the action selection neural network using the reinforcement learning technique based at least in part on the exploration reward comprises: training the action selection neural network using the reinforcement learning technique based on the overall reward.  

However, in a related field of endeavor (reinforcement learning with deep predictive models with respect to agents facing “the exploration versus exploitation tradeoff”, see p. 1, section 1), STADIE teaches and makes obvious:
receiving a task reward that characterizes a progress of the agent towards accomplishing a task after the agent performs the selected action; and (STADIE, p. 2, section 2:

    PNG
    media_image2.png
    360
    788
    media_image2.png
    Greyscale

STADIE, p. 3, section 3, Algorithm 1: “line 17: return optimized policy π”;
Examiner’s Note: STADIE teaches an overall reward RBONUS, as shown in Equation 1, where RBONUS comprises R(s,a) (corresponding to recited task reward) and βN(s,a) (corresponding to recited exploration reward); the DASGUPTA-STADIE combination now modifies the action selection policies of DASGUPTA to utilize the RBONUS reward function when considering reward information (see para. 0032 of DASGUPTA))

determining an overall reward based on the task reward and the modified exploration reward; (STADIE, p. 2, section 2:

    PNG
    media_image2.png
    360
    788
    media_image2.png
    Greyscale

STADIE, p. 3, section 3, Algorithm 1: “line 17: return optimized policy π”;
Examiner’s Note: STADIE teaches an overall reward RBONUS, as shown in Equation 1, where RBONUS comprises R(s,a) (corresponding to recited task reward) and βN(s,a) (corresponding to recited exploration reward), and where Algorithm 1 performs reinforcement learning using such RBONUS (see line 6) to return an optimized policy (see line 17); the DASGUPTA-STADIE combination now modifies the action selection policies of DASGUPTA to utilize the RBONUS reward function, where the rewards are modified to include the exploratory award of claim 8, when considering reward information (see para. 0032 of DASGUPTA))

wherein training the action selection neural network using the reinforcement learning technique based at least in part on the exploration reward comprises: training the action selection neural network using the reinforcement learning technique based on the overall reward.  .  (STADIE, p. 3, section 3, Algorithm 1:

    PNG
    media_image3.png
    442
    668
    media_image3.png
    Greyscale

Examiner’s Note: the DASGUPTA-STADIE combination now modifies the reward information of DASGUPTA (see para. 0032 of DASGUPTA) to utilize the overall rewards as in STADIE)

Before the effective filing date of the present application, it would have been obvious to one of ordinary skill in the art to combine the teachings of DASGUPTA with STADIE as explained above.  As disclosed by STADIE, one of ordinary skill would have been motivated to do so in order to incentivize exploration of novel states using exploration bonuses, which “can produce rapid and effective exploration.” (p. 1, section 1).  As further disclosed by STADIE, one of ordinary skill would be motivated to use the exploration bonuses of STADIE because using such techniques “can scale to large problems where Bayesian approaches to exploration become impractical.” (p. 2, section 1).

Regarding Claim 11
	DASGUPTA and STADIE teach the method of claim 10 as explained above.  However, DASGUPTA fails to explicitly teach:
wherein determining the overall reward based on the task reward and the modified exploration reward comprises: determining the overall reward as a sum of the task reward and the modified exploration reward.  

However, in a related field of endeavor (reinforcement learning with deep predictive models with respect to agents facing “the exploration versus exploitation tradeoff”, see p. 1, section 1), STADIE teaches and makes obvious:
wherein determining the overall reward based on the task reward and the modified exploration reward comprises: determining the overall reward as a sum of the task reward and the modified exploration reward.  (STADIE, p. 2, section 2:

    PNG
    media_image2.png
    360
    788
    media_image2.png
    Greyscale

STADIE, p. 3, section 3, Algorithm 1: “line 17: return optimized policy π”;
Examiner’s Note: STADIE teaches an overall reward RBONUS, as shown in Equation 1, where RBONUS comprises R(s,a) (corresponding to recited task reward) and βN(s,a) (corresponding to recited exploration reward); the DASGUPTA-STADIE combination now modifies the action selection policies of DASGUPTA to utilize the RBONUS reward function when considering reward information (see para. 0032 of DASGUPTA), where the exploration reward is modified as explained in the rejections to claim 8)

Before the effective filing date of the present application, it would have been obvious to one of ordinary skill in the art to combine the teachings of DASGUPTA with STADIE as explained above.  As disclosed by STADIE, one of ordinary skill would have been motivated to do so in order to incentivize exploration of novel states using exploration bonuses, which “can produce rapid and effective exploration.” (p. 1, section 1).  As further disclosed by STADIE, one of ordinary skill would be motivated to use the exploration bonuses of STADIE because using such techniques “can scale to large problems where Bayesian approaches to exploration become impractical.” (p. 2, section 1).

Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over DASGUPTA in view of STADIE and further in view of Janisch, Jaromír, et al. "Classification with Costly Features using Deep Reinforcement Learning." arXiv preprint arXiv:1711.07364 (2018), hereinafter referenced as JANISCH.

Regarding Claim 12
	DASGUPTA and STADIE teach the method of claim 7 as explained above.  However, DASGUPTA and STADIE fail to explicitly teach:
wherein the reinforcement learning technique is a Retrace double Q-learning technique with prioritized experience replay.  

	However, in a related field of endeavor (reinforcement learning), JANISCH teaches and makes obvious:
wherein the reinforcement learning technique is a Retrace double Q-learning technique with prioritized experience replay.  (JANISCH, p. 3, left column: “Retrace [Munos et al., 2016] is a method to efficiently utilize long traces of experience with truncated importance sampling. We store generated trajectories into an experience replay buffer [Lin, 1993] and utilize whole episode returns by recursively expanding eq. (2). The stored trajectories are off the current policy and a correction is needed. For a sequence ..., we implement Retrace together with Double Q-learning”;
Examiner’s Note: the DASGUPTA-STADIE-JANISCH combination now uses the Retrace method, together with Double Q-learning, which uses an experience replay buffer, when implementing Algorithm 1 of STADIE which performs reinforcement learning)

Before the effective filing date of the present application, it would have been obvious to one of ordinary skill in the art to combine the teachings of DASGUPTA with STADIE and JANISCH as explained above.  As disclosed by JANISCH, one of ordinary skill would have been motivated to do so because the authors of JANISCH established that such combination of Retrace with double Q-Learning leads to “faster convergence” and can also “be easily parallelized across all episodes.” (p. 3, left column).

Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over DASGUPTA in view of US 20200018860 A1, hereinafter referenced as MARMET.

Regarding Claim 13
	DASGUPTA teaches the method of claim 1 as explained above.  However, DASGUPTA fails to explicitly teach:
wherein the set of possible exploration importance factors is a discrete set.  

	However, in a related field of endeavor (analysis of environments, see para. 0003), MARMET teaches and makes obvious:
wherein the set of possible exploration importance factors is a discrete set.  (MARMET, para. 0034: “In an embodiment, user positions and/or G.N.S.S. satellite positions are quantized (or discretized or quantified): positions are restricted to discrete values rather than to a continuous set of values. This minimizes the amount of data to be manipulated and also averages or smoothes uncertainties in positioning.”; 
Examiner’s Note: the DASGUPTA-MARMET combination now discretizes the possible values for the exploration term (ε) of DASGUPTA using the teachings of MARMET)

Before the effective filing date of the present application, it would have been obvious to one of ordinary skill in the art to combine the teachings of DASGUPTA with MARMET as explained above.  As disclosed by MARMET, one of ordinary skill would have been motivated to do so in order to reduce the “amount of data to be manipulated” and also to average or smooth data uncertainties.  (para. 0034).

Allowable Subject Matter
Claims 14-18 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims, provided that the rejections under 35 U.S.C. 101 are overcome

The following is a statement of reasons for the indication of allowable subject matter: 

Claim 14 would be considered allowable, provided that the rejections under 35 U.S.C. 101 are overcome, if rewritten in independent form including all of the limitations of the base claim and any intervening claims, because none of the references of record either alone or in combination fairly disclose or suggest the combination of limitations specified in claim 14, including at least:  
wherein determining the exploration reward comprises: providing the subsequent observation and the prior observations to an embedding neural network, wherein the embedding neural network is configured to process an input observation to generate an embedded representation of the input observation; and 
determining the exploration reward based on, for each prior observation, a similarity measure between the embedded representation of the prior observation and the embedded representation of the subsequent observation.  

The closest prior art of record discloses: 
DASGUPTA teaches reinforcement learning techniques where an action selection policy observes a state of an environment and determines an action to be performed. (para. 0032).
STADIE teaches a reinforcement learning technique that provides an incentive to explore novel states, where such incentive is reflected in a reward function used in the reinforcement learning. (p. 2, section 2., and p. 3, section 3).
US 20200364557 A1 (Ostrovski) teaches, in the context of reinforcement learning, having one or more neural network layers that are configured to generate an encoded representation of an observation. (para. 0072).
US 20190025803 A1 (Akitomi) teaches: “the reward sharing extraction 114 calculates a distance between variables corresponding to states of each reward function information 123 and searches for reward function information 123 in which the distance is equal to or less than a threshold value.” (para. 0168).

However, the examiner has found that the distinct feature of the Applicant's claimed invention over the prior art is the explicit claiming of the aforementioned limitations in combination with all the other limitations as specified in claim 14.  Moreover, the examiner has found that one of ordinary skill would not have been motivated to combine the prior art of record in the manner recited without the hindsight aid of Applicant’s disclosure, and in particular, the “determining the exploration reward based on, for each prior observation, a similarity measure between the embedded representation of the prior observation and the embedded representation of the subsequent observation” as recited in claim 14.

	Claims 15-18 depend from claim 14, and would similarly be allowed if rewritten in independent form including all of the limitations of the base claim and any intervening claims, for the same reasons explained with respect to claim 14, provided that the rejections under 35 U.S.C. 101 are overcome.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Tang, Haoran, et al. "# exploration: A study of count-based exploration for deep reinforcement learning." Advances in neural information processing systems 30 (2017).  See equation (1) in section 2.2 where the reward function has an “exploration bonus”.
Xu, Zhi-Xiong, et al. "A study of count-based exploration and bonus for reinforcement learning." 2017 IEEE 2nd International Conference on Cloud Computing and Big Data Analysis (ICCCBDA). IEEE, 2017.  See equation (2) on page 426 where the reward function has an “exploration bonus”.
US 20190371348 A1 (Mirzahasanloo). “After the decoder actor (policy) has also synchronously received the same reward, the critic (value) may calculate the temporal difference error, and accordingly update the policy, action-selection parameters, and the exploration/exploitation rules.” (para. 0142).

Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL C LEE whose telephone number is (571)272-4933. The examiner can normally be reached M-F 12:00 pm - 8:00 pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Omar Fernandez Rivas can be reached at 571-272-2589. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/MICHAEL C. LEE/Examiner, Art Unit 2128
Read full office action
Prosecution Timeline

Jun 13, 2023
Application Filed
Jan 30, 2026
Non-Final Rejection — §101, §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/475,724
Patent 12603081
METHOD AND SERVER FOR A TEXT-TO-SPEECH PROCESSING
2y 5m to grant Granted Apr 14, 2026
17/732,871
Patent 12602605
QUANTUM COMPUTER ARCHITECTURE BASED ON MULTI-QUBIT GATES
2y 5m to grant Granted Apr 14, 2026
17/207,554
Patent 12591915
METHODS AND SYSTEMS FOR DETERMINING RECOMMENDATIONS BASED ON REAL-TIME OPTIMIZATION OF MACHINE LEARNING MODELS
2y 5m to grant Granted Mar 31, 2026
18/885,396
Patent 12585743
INTERFACE ACCESS PROCESSING METHOD, COMPUTER DEVICE AND STORAGE MEDIUM
2y 5m to grant Granted Mar 24, 2026
17/486,877
Patent 12568935
AI-BASED LIVESTOCK MANAGEMENT SYSTEM AND LIVESTOCK MANAGEMENT METHOD THEREOF
2y 5m to grant Granted Mar 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
59%
Grant Probability
86%
With Interview (+27.1%)
3y 2m
Median Time to Grant
Low
PTA Risk
Based on 136 resolved cases by this examiner. Grant probability derived from career allow rate.