Last updated: April 19, 2026
Application No. 17/797,203
TRAINING MULTI-OBJECTIVE NEURAL NETWORK REINFORCEMENT LEARNING SYSTEMS

Final Rejection §101§102§103
Filed
Aug 03, 2022
Examiner
PRESSLY, KURT NICHOLAS
Art Unit
2125
Tech Center
2100 — Computer Architecture & Software
Assignee
Deepmind Technologies Limited
OA Round
2 (Final)
Interview Optional

— +2.3% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 23 resolved cases, 2023–2026
Examiner Intelligence

PRESSLY, KURT NICHOLAS View full profile →
Grants only 26% of cases
Career Allow Rate
6 granted / 23 resolved
-28.9% vs TC avg
Minimal +2% lift
Without
With
+2.3%
Interview Lift
resolved cases with interview
Typical timeline
4y 8m
Avg Prosecution
33 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
36.1%
-3.9% vs TC avg
§103
35.8%
-4.2% vs TC avg
§102
16.0%
-24.0% vs TC avg
§112
11.6%
-28.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 23 resolved cases
Office Action

§101 §102 §103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Objections
Claim 10 is objected to because of the following informalities: “The method of claim 9 when dependent on claim 6” should read “The method of claim 9”.  Appropriate correction is required.

Claim 16 is objected to because of the following informalities: “determining an updated multi-objective policy based on combination of the separate probability distributions for the plurality of objectives” should read “determining an updated multi-objective policy based on the combination of the separate probability distributions for the plurality of objectives”.  Appropriate correction is required.
Claims 17-18 are further objected to for dependence, either directly or indirectly, on claim 16.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-19, and 21 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.

Regarding Claim 1,
Claim 1 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 1 is directed to a method for training a neural network system by reinforcement learning, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“determining a separate action-value function for each of the plurality of objectives based on the set of one or more trajectories, each separate action-value function determining an action value representing an estimated return according to the corresponding objective that would result from the agent performing a given action in response to a given state according to the previous policy”
“combining the separate action-value functions of the plurality of objectives”
“determining an updated multi-objective policy based on the combination of the separate action-value functions for the plurality of objectives”
As drafted, under their broadest reasonable interpretations, cover mental processes, i.e., concepts performed in the human mind (including an observation, evaluation, judgement, opinion). The above limitations in the context of this claim correspond to mental processes, e.g., evaluation and judgement with assistance of pen and paper. Additionally or alternatively, the limitations under their broadest reasonable interpretations cover mathematical concepts, i.e., mathematical relationships, mathematical formulas or equations, and mathematical calculations. The above limitations in the context of this claim additionally or alternatively encompass mathematical calculations.
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recited additional elements that are mere instructions to apply an exception (See MPEP 2106.05(f)) and insignificant extra-solution activity (See MPEP 2106.05(g)).
The limitations:
“the neural network system being configured to receive an input observation characterizing a state of an environment interacted with by an agent and to select and output an action in accordance with a policy that aims to satisfy a plurality of objectives”
As drafted, are additional elements that amount to no more than mere instructions to apply an exception for the abstract ideas. See MPEP 2106.05(f).
The limitations:
“obtaining a set of one or more trajectories, each trajectory comprising a state of an environment, an action applied by the agent to the environment according to a previous policy in response to the state, and a set of rewards for the action, each reward relating to a corresponding objective of the plurality of objectives”
As drafted, are additional elements that amount to no more than insignificant extra-solution activity. See MPEP 2106.05(g).
Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract ideas into a practical application, all of the additional elements are “mere instructions to apply” and “insignificant extra-solution activity”. Specifically, the obtaining limitation recites the well-understood, routine, and conventional activity of storing and retrieving information in memory. MPEP 2106.05(d)(II); Versata Dev. Group, Inc. v. SAP Am., Inc., 793 F.3d 1306, 1334, 115 USPQ2d 1681, 1701 (Fed. Cir. 2015). Mere instructions to apply and insignificant extra-solution activity cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 2,
Claim 2 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 2 is directed to a method for training a neural network system by reinforcement learning, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“determining an objective-specific policy for each objective in the plurality of objectives, each objective-specific policy being determined based on the corresponding action-value function for the corresponding objective”
“determining the updated policy by fitting a set of policy parameters of the updated policy to a combination of the objective-specific policies”
As drafted, under their broadest reasonable interpretations, cover mental processes, i.e., concepts performed in the human mind (including an observation, evaluation, judgement, opinion). The above limitations in the context of this claim correspond to mental processes, e.g., evaluation and judgement with assistance of pen and paper. Additionally or alternatively, the limitations under their broadest reasonable interpretations cover mathematical concepts, i.e., mathematical relationships, mathematical formulas or equations, and mathematical calculations. The above limitations in the context of this claim additionally or alternatively encompass mathematical calculations.
Step 2A Prong Two Analysis: See corresponding analysis of claim 1.
Step 2B Analysis: See corresponding analysis of claim 1.

Regarding Claim 3,
Claim 3 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 3 is directed to a method for training a neural network system by reinforcement learning, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein fitting the set of policy parameters of the updated policy to the combination of the objective-specific policies comprises determining the set of policy parameters that minimizes a difference between the updated policy and the combination of the objective-specific policies”
As drafted, under their broadest reasonable interpretations, cover mental processes, i.e., concepts performed in the human mind (including an observation, evaluation, judgement, opinion). The above limitations in the context of this claim correspond to mental processes, e.g., evaluation and judgement with assistance of pen and paper. Additionally or alternatively, the limitations under their broadest reasonable interpretations cover mathematical concepts, i.e., mathematical relationships, mathematical formulas or equations, and mathematical calculations. The above limitations in the context of this claim additionally or alternatively encompass mathematical calculations.
Step 2A Prong Two Analysis: See corresponding analysis of claim 2.
Step 2B Analysis: See corresponding analysis of claim 2.

Regarding Claim 4,
Claim 4 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 4 is directed to a method for training a neural network system by reinforcement learning, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein the set of policy parameters for the updated policy are constrained such that a difference between the updated policy and the previous policy cannot exceed a trust region threshold”
As drafted, under their broadest reasonable interpretations, cover mental processes, i.e., concepts performed in the human mind (including an observation, evaluation, judgement, opinion). The above limitations in the context of this claim correspond to mental processes, e.g., evaluation and judgement with assistance of pen and paper. Additionally or alternatively, the limitations under their broadest reasonable interpretations cover mathematical concepts, i.e., mathematical relationships, mathematical formulas or equations, and mathematical calculations. The above limitations in the context of this claim additionally or alternatively encompass mathematical calculations.
Step 2A Prong Two Analysis: See corresponding analysis of claim 2.
Step 2B Analysis: See corresponding analysis of claim 2.

Regarding Claim 5,
Claim 5 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 5 is directed to a method for training a neural network system by reinforcement learning, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein determining an objective-specific policy for each objective comprises determining objective-specific policy parameters for the objective-specific policy that increase an expected return according to the action-value function for the corresponding objective relative to the previous policy”
As drafted, under their broadest reasonable interpretations, cover mental processes, i.e., concepts performed in the human mind (including an observation, evaluation, judgement, opinion). The above limitations in the context of this claim correspond to mental processes, e.g., evaluation and judgement with assistance of pen and paper. Additionally or alternatively, the limitations under their broadest reasonable interpretations cover mathematical concepts, i.e., mathematical relationships, mathematical formulas or equations, and mathematical calculations. The above limitations in the context of this claim additionally or alternatively encompass mathematical calculations.
Step 2A Prong Two Analysis: See corresponding analysis of claim 2.
Step 2B Analysis: See corresponding analysis of claim 2.

Regarding Claim 6,
Claim 6 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 6 is directed to a method for training a neural network system by reinforcement learning, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein determining the objective-specific policy for each objective comprises determining objective-specific policy parameters for the objective-specific policy that maximize the expected return according to the separate action-value function for the corresponding objective relative to the previous policy, subject to a constraint that the objective- specific policy may not differ from the previous policy by more than a corresponding difference threshold”
As drafted, under their broadest reasonable interpretations, cover mental processes, i.e., concepts performed in the human mind (including an observation, evaluation, judgement, opinion). The above limitations in the context of this claim correspond to mental processes, e.g., evaluation and judgement with assistance of pen and paper. Additionally or alternatively, the limitations under their broadest reasonable interpretations cover mathematical concepts, i.e., mathematical relationships, mathematical formulas or equations, and mathematical calculations. The above limitations in the context of this claim additionally or alternatively encompass mathematical calculations.
Step 2A Prong Two Analysis: See corresponding analysis of claim 5.
Step 2B Analysis: See corresponding analysis of claim 5.

Regarding Claim 7,
Claim 7 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 7 is directed to a method for training a neural network system by reinforcement learning, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: See corresponding analysis of claim 6.
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recited additional elements that do not apply the exception in a meaningful way (See MPEP 2106.05(e)).
The limitations:
“wherein the corresponding difference threshold represents a relative contribution of the corresponding objective to the updated policy”
As drafted, is an additional element that does not apply an exception for the abstract ideas in a meaningful way. See MPEP 2106.05(e).
Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract ideas into a practical application, all of the additional elements do not apply the exception in a meaningful way. The claim is not patent eligible.

Regarding Claim 8,
Claim 8 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 8 is directed to a method for training a neural network system by reinforcement learning, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: See corresponding analysis of claim 2.
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recited additional elements that do not apply the exception in a meaningful way (See MPEP 2106.05(e)).
The limitations:
“wherein the objective-specific policies are non-parametric policies”
As drafted, is an additional element that does not apply an exception for the abstract ideas in a meaningful way. See MPEP 2106.05(e).
Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract ideas into a practical application, all of the additional elements do not apply the exception in a meaningful way. The claim is not patent eligible.

Regarding Claim 9,
Claim 9 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 9 is directed to a method for training a neural network system by reinforcement learning, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: See corresponding analysis of claim 2.
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recited additional elements that do not apply the exception in a meaningful way (See MPEP 2106.05(e)).
The limitations:
“wherein each objective- specific policy, qk(a|s), is determined from a scaled action-value function for the objective of the objective-specific policy, wherein the scaled action-value function is scaled by a value dependent upon a preference for the objective”
As drafted, is an additional element that does not apply an exception for the abstract ideas in a meaningful way. See MPEP 2106.05(e).
Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract ideas into a practical application, all of the additional elements do not apply the exception in a meaningful way. The claim is not patent eligible.

Regarding Claim 10,
Claim 10 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 10 is directed to a method for training a neural network system by reinforcement learning, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: See corresponding analysis of claim 9.
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recited additional elements that do not apply the exception in a meaningful way (See MPEP 2106.05(e)).
The limitations:
“wherein the value dependent upon a preference for the objective is dependent on a difference threshold for the objective”
As drafted, is an additional element that does not apply an exception for the abstract ideas in a meaningful way. See MPEP 2106.05(e).
Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract ideas into a practical application, all of the additional elements do not apply the exception in a meaningful way. The claim is not patent eligible.

Regarding Claim 11,
Claim 11 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 11 is directed to a method for training a neural network system by reinforcement learning, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein each objective-specific policy, qk(a|s), is determined by calculating:

    PNG
    media_image1.png
    50
    269
    media_image1.png
    Greyscale

where: N is a normalization constant; k is the objective; a is an action; s is a state; πold(a|s) is the previous policy; Qk(s,a) is the separate action-value function for the objective; and ƞk is a temperature parameter”
As drafted, under their broadest reasonable interpretations, cover mathematical
concepts, i.e., mathematical relationships, mathematical formulas or equations, and
mathematical calculations. The above limitations in the context of this claim encompass
mathematical calculations.
Step 2A Prong Two Analysis: See corresponding analysis of claim 8.
Step 2B Analysis: See corresponding analysis of claim 8.

Regarding Claim 12,
Claim 12 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 12 is directed to a method for training a neural network system by reinforcement learning, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein, for each objective, k, the temperature parameter ƞk is determined by solving the following equation:

    PNG
    media_image2.png
    49
    480
    media_image2.png
    Greyscale

where:ϵk is a difference threshold for the corresponding objective; and µ(s) is a visitation distribution”
As drafted, under their broadest reasonable interpretations, cover mathematical
concepts, i.e., mathematical relationships, mathematical formulas or equations, and
mathematical calculations. The above limitations in the context of this claim encompass
mathematical calculations.
Step 2A Prong Two Analysis: See corresponding analysis of claim 11.
Step 2B Analysis: See corresponding analysis of claim 11.

Regarding Claim 13,
Claim 13 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 13 is directed to a method for training a neural network system by reinforcement learning, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: See corresponding analysis of claim 12.
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recited additional elements that do not apply the exception in a meaningful way (See MPEP 2106.05(e)).
The limitations:
“wherein each temperature parameter is determined via gradient descent”
As drafted, is an additional element that does not apply an exception for the abstract ideas in a meaningful way. See MPEP 2106.05(e).
Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract ideas into a practical application, all of the additional elements do not apply the exception in a meaningful way. The claim is not patent eligible.

Regarding Claim 14,
Claim 14 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 14 is directed to a method for training a neural network system by reinforcement learning, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: See corresponding analysis of claim 1.
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recited additional elements that do not apply the exception in a meaningful way (See MPEP 2106.05(e)).
The limitations:
“wherein each separate action-value function provides a distribution of action values for a corresponding objective of the plurality of objectives across a range of potential state-action pairs for the previous policy”
As drafted, is an additional element that does not apply an exception for the abstract ideas in a meaningful way. See MPEP 2106.05(e).
Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract ideas into a practical application, all of the additional elements do not apply the exception in a meaningful way. The claim is not patent eligible.

Regarding Claim 15,
Claim 15 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 15 is directed to a method for training a neural network system by reinforcement learning, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: See corresponding analysis of claim 1.
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recited additional elements that do not apply the exception in a meaningful way (See MPEP 2106.05(e)).
The limitations:
“wherein each separate action-value function outputs an action-value representing an expected cumulative discounted reward for the corresponding objective when choosing a given action in response to a given state”
As drafted, is an additional element that does not apply an exception for the abstract ideas in a meaningful way. See MPEP 2106.05(e).
Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract ideas into a practical application, all of the additional elements do not apply the exception in a meaningful way. The claim is not patent eligible.

Regarding Claim 16,
Claim 16 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 16 is directed to a method for training a neural network system by reinforcement learning, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“determining a separate probability distribution for each of the plurality of objectives based on the set of one or more trajectories, each separate probability distribution providing a distribution of action probabilities that would increase an expected return according to a corresponding objective relative to the policy”
“combining the separate probability distributions of the plurality of objectives”
“determining an updated multi-objective policy based on combination of the separate probability distributions for the plurality of objectives”
As drafted, under their broadest reasonable interpretations, cover mental processes, i.e., concepts performed in the human mind (including an observation, evaluation, judgement, opinion). The above limitations in the context of this claim correspond to mental processes, e.g., evaluation and judgement with assistance of pen and paper. Additionally or alternatively, the limitations under their broadest reasonable interpretations cover mathematical concepts, i.e., mathematical relationships, mathematical formulas or equations, and mathematical calculations. The above limitations in the context of this claim additionally or alternatively encompass mathematical calculations.
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recited additional elements that are mere instructions to apply an exception (See MPEP 2106.05(f)) and insignificant extra-solution activity (See MPEP 2106.05(g)).
The limitations:
“the neural network system being configured to receive an input observation characterizing a state of an environment interacted with by an agent and to select and output an action in accordance with a policy that aims to satisfy a plurality of objectives”
As drafted, are additional elements that amount to no more than mere instructions to apply an exception for the abstract ideas. See MPEP 2106.05(f).
The limitations:
“obtaining a set of one or more trajectories, each trajectory comprising a state of an environment, an action applied by the agent to the environment according to a previous policy in response to the state, and a set of rewards for the action, each reward relating to a corresponding objective of the plurality of objectives”
As drafted, are additional elements that amount to no more than insignificant extra-solution activity. See MPEP 2106.05(g).
Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract ideas into a practical application, all of the additional elements are “mere instructions to apply” and “insignificant extra-solution activity”. Specifically, the obtaining limitation recites the well-understood, routine, and conventional activity of storing and retrieving information in memory. MPEP 2106.05(d)(II); Versata Dev. Group, Inc. v. SAP Am., Inc., 793 F.3d 1306, 1334, 115 USPQ2d 1681, 1701 (Fed. Cir. 2015). Mere instructions to apply and insignificant extra-solution activity cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 17,
Claim 17 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 17 is directed to a method for training a neural network system by reinforcement learning, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“determining a value function defining a value representing an expected return according to the corresponding objective that would result from the agent following the previous policy from a given state”
“determining the separate probability distribution for the objective based on the value function”
As drafted, under their broadest reasonable interpretations, cover mental processes, i.e., concepts performed in the human mind (including an observation, evaluation, judgement, opinion). The above limitations in the context of this claim correspond to mental processes, e.g., evaluation and judgement with assistance of pen and paper. Additionally or alternatively, the limitations under their broadest reasonable interpretations cover mathematical concepts, i.e., mathematical relationships, mathematical formulas or equations, and mathematical calculations. The above limitations in the context of this claim additionally or alternatively encompass mathematical calculations.
Step 2A Prong Two Analysis: See corresponding analysis of claim 16.
Step 2B Analysis: See corresponding analysis of claim 16.

Regarding Claim 18,
Claim 18 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 18 is directed to a method for training a neural network system by reinforcement learning, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: See corresponding analysis of claim 17.
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recited additional elements that do not apply the exception in a meaningful way (See MPEP 2106.05(e)).
The limitations:
“each separate probability distribution is a state-action distribution defining a distribution of probabilities of state-action pairs and the value function for each objective is a state-value function defining a value representing an expected return according to the corresponding objective that would result from the agent following the previous policy from a given state”
“each separate probability distribution is an objective-specific policy defining a distribution of probabilities of actions over states and the value function is an action-value function representing an expected return according to the corresponding objective that would result from the agent performing a given action in response to a given state according to the previous policy”
As drafted, is an additional element that does not apply an exception for the abstract ideas in a meaningful way. See MPEP 2106.05(e).
Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract ideas into a practical application, all of the additional elements do not apply the exception in a meaningful way. The claim is not patent eligible.

Regarding Claim 19,
Claim 19 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 19 is directed to a system comprising one or more computers and one or more storage devices storing instructions, which is directed to a machine, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“determining a separate action-value function for each of the plurality of objectives based on the set of one or more trajectories, each separate action-value function determining an action value representing an estimated return according to the corresponding objective that would result from the agent performing a given action in response to a given state according to the previous policy”
“combining the separate action-value functions of the plurality of objectives”
“determining an updated multi-objective policy based on the combination of the separate action-value functions for the plurality of objectives”
As drafted, under their broadest reasonable interpretations, cover mental processes, i.e., concepts performed in the human mind (including an observation, evaluation, judgement, opinion). The above limitations in the context of this claim correspond to mental processes, e.g., evaluation and judgement with assistance of pen and paper. Additionally or alternatively, the limitations under their broadest reasonable interpretations cover mathematical concepts, i.e., mathematical relationships, mathematical formulas or equations, and mathematical calculations. The above limitations in the context of this claim additionally or alternatively encompass mathematical calculations.
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recited additional elements that are mere instructions to apply an exception (See MPEP 2106.05(f)) and insignificant extra-solution activity (See MPEP 2106.05(g)).
The limitations:
“A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations for training a neural network system by reinforcement learning”
“the neural network system being configured to receive an input observation characterizing a state of an environment interacted with by an agent and to select and output an action in accordance with a policy that aims to satisfy a plurality of objectives”
As drafted, are additional elements that amount to no more than mere instructions to apply an exception for the abstract ideas. See MPEP 2106.05(f).
The limitations:
“obtaining a set of one or more trajectories, each trajectory comprising a state of an environment, an action applied by the agent to the environment according to a previous policy in response to the state, and a set of rewards for the action, each reward relating to a corresponding objective of the plurality of objectives”
As drafted, are additional elements that amount to no more than insignificant extra-solution activity. See MPEP 2106.05(g).
Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract ideas into a practical application, all of the additional elements are “mere instructions to apply” and “insignificant extra-solution activity”. Specifically, the obtaining limitation recites the well-understood, routine, and conventional activity of storing and retrieving information in memory. MPEP 2106.05(d)(II); Versata Dev. Group, Inc. v. SAP Am., Inc., 793 F.3d 1306, 1334, 115 USPQ2d 1681, 1701 (Fed. Cir. 2015). Mere instructions to apply and insignificant extra-solution activity cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 21,
Claim 21 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 21 is directed to a system comprising one or more computers and one or more storage devices storing instructions, which is directed to a machine, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“determining an objective-specific policy for each objective in the plurality of objectives, each objective-specific policy being determined based on the corresponding separate action-value function for the corresponding objective”
“determining the updated policy by fitting a set of policy parameters of the updated policy to a combination of the objective-specific policies”
As drafted, under their broadest reasonable interpretations, cover mental processes, i.e., concepts performed in the human mind (including an observation, evaluation, judgement, opinion). The above limitations in the context of this claim correspond to mental processes, e.g., evaluation and judgement with assistance of pen and paper. Additionally or alternatively, the limitations under their broadest reasonable interpretations cover mathematical concepts, i.e., mathematical relationships, mathematical formulas or equations, and mathematical calculations. The above limitations in the context of this claim additionally or alternatively encompass mathematical calculations.
Step 2A Prong Two Analysis: See corresponding analysis of claim 19.
Step 2B Analysis: See corresponding analysis of claim 19.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1-3, 5, 14-19, and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Schaul et al. (“Universal Value Function Approximators”) (“Schaul”) in view of Van Seijen et al. (U.S. Patent Publication No. 2018/0165603) (“Van Seijen”).

Regarding claim 1, Schaul teaches a method for training a neural network system by reinforcement learning (Schaul Section 3.1 Supervised Learning of UVFAs “We consider two approaches to learning UVFAs, first using a direct end-to-end training procedure, and second using a two-stage training procedure that exploits the factorized structure of a two-stream function approximator.”; Section 4.2 Interpolation “The data matrix M is constructed from all states, but only the goals in the training set (half of all possible states, randomly selected); a separate three-layer MLP is used for φ and ψ, and training follows our proposed two-stage approach (lines 17 to 24 in Algorithm 1 below; see also Section 3.1 and Appendix B)” Schaul provides a method for training a neural network system by reinforcement learning as summarized in Algorithm 1 and Section 3.1.), the neural network system being configured to receive an input observation characterizing a state of an environment interacted with by an agent and to select and output an action in accordance with a policy that aims to satisfy a plurality of objectives (Schaul Section 3 Universal Value Function Approximators “Figure 1 schematically depicts possible function approximators: the most direct approach, F : S × G 7→ R simply concatenates state and goal together as a joint input. The mapping from concatenated input to regression target can then be dealt with a non-linear function approximator such as a multi-layer perceptron (MLP).” Schaul, as shown in Figure 1 (see “s” displayed in Figure 1), provides an input observation characterizing a state of an environment interacted with by an agent and to select and output an action in accordance with a policy that aims to satisfy a plurality of objectives, as displayed by the outputs in Figure 1.), the method comprising: obtaining a set of one or more trajectories, each trajectory comprising a state of an environment, an action applied by the agent to the environment according to a previous policy in response to the state, and a set of rewards for the action, each reward relating to a corresponding objective of the plurality of objectives (Schaul Algorithm 1 UVFA learning from Horde targets, lines 2-4; Section 4 Supervised Learning Experiments “In this paper we will side-step the thorny issue of where goals come from, and how they are represented; instead, if not mentioned otherwise, we will explore the simple case where goals are states themselves, i.e. G ⊂ S and entering a goal is rewarded. The resulting pseudo-discount and pseudo-reward functions can then be defined as: …where γext is the external discount function.” Schaul, as shown in lines 3-4 of Algorithm 1, provides obtaining trajectories which comprise a state of an environment, an action applied by the agent to the environment according to a previous policy in response to the state, and a set of rewards for the action, each reward relating to a corresponding objective of the plurality of objectives.).
	Schaul fails to teach determining a separate action-value function for each of the plurality of objectives based on the set of one or more trajectories, each separate action-value function determining an action value representing an estimated return according to the corresponding objective that would result from the agent performing a given action in response to a given state according to the previous policy; combining the separate action-value functions of the plurality of objectives; and determining an updated multi-objective policy based on the combination of the separate action-value functions for the plurality of objectives.
However, Van Seijen teaches determining a separate action-value function for each of the plurality of objectives based on the set of one or more trajectories, each separate action-value function determining an action value representing an estimated return according to the corresponding objective that would result from the agent performing a given action in response to a given state according to the previous policy (Van Seijen [0068] “Alternatively, each piece of fruit 102 may be assigned to a specific agent whose only learning objective is to estimate the optimal action-value function for reaching that piece of fruit 102… Moreover, these n agents can learn in parallel using off-policy learning. Therefore, the machine learning problem becomes much easier.”; [0073] “Actions a are taken at discrete time steps according to policy π, which maps states to actions. For example, actions a may be taken at discrete time steps t=0, 1, 2, . . . according to a policy π: X×A.fwdarw.[0,1], which defines for each action the selection probability conditioned on the state. Each policy π has a corresponding action-value function…”; [0170] “Given any fixed aggregator, stable (off-policy) learning can occur if the advisors are Markov because each advisor can be seen as an independent learner training from trajectories controlled by an arbitrary behavioral policy.”; Van Seijen provides a plurality of agents determining respective action-value functions in parallel based on their respective objective, including use of policy and trajectories.); combining the separate action-value functions of the plurality of objectives (Van Seijen [0068] “Alternatively, each piece of fruit 102 may be assigned to a specific agent whose only learning objective is to estimate the optimal action-value function for reaching that piece of fruit 102. This agent sees a reward of +1 only if its assigned fruit 102 is reached and otherwise sees no reward. The state-space for this agent can ignore all other fruit 102 because they are irrelevant for its value function. An aggregator can then make the final action selection from among the agents of each piece of fruit 102.”; [0238] “Because each agent has its own reward function, each agent i also has its own Q-value function associated with it: Q.sub.i(s, a; θ). To derive a policy from these multiple action-value functions, an aggregator receives the action-values (i.e., a single value for each action), using the same linear combination as used in the reward decomposition.” Van Seijen provides aggregating action-value functions corresponding to combining the separate action-value functions of the plurality of objectives.); and determining an updated multi-objective policy based on the combination of the separate action-value functions for the plurality of objectives (Van Seijen [0062] “Alternatively, each piece of fruit 102 may be assigned to a specific agent whose only learning objective is to estimate the optimal action-value function for reaching that piece of fruit 102. This agent sees a reward of +1 only if its assigned fruit 102 is reached and otherwise sees no reward. The state-space for this agent can ignore all other fruit 102 because they are irrelevant for its value function. An aggregator can then make the final action selection from among the agents of each piece of fruit 102.”; [0082] “Given a learning method that converges to the optimal policy on a single-agent MDP task, applying this method independently to each of the agents of the SoC model, the overall policy of the SoC model converges to a fixed point. Moreover, this fixed point can depend only on the SoC model and not on the particular learning algorithm that is being used.”; [0248] “In some examples, HRA can apply multi-objective learning to smooth a value function of a single reward function.” Van Seijen provides determining an optimal multi-objective policy based on action-value function aggregation.).
Schaul and Van Seijen are both considered to be analogous to the claimed invention because they are in the same field of artificial intelligence and more specifically reinforcement learning. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Schaul with the above teachings of Van Seijen. Doing so would allow for specialized agents for different parts of a task and a way of transferring trained agents (Van Seijen [0006] “This approach has at least the following advantages: 1) it allows for specialized agents for different parts of the task, and 2) it provides a new way to transfer knowledge, by transferring trained agents”).

Regarding claim 2, Schaul in view of Van Seijen teaches the method of claim 1 as discussed above in the rejection of claim 1, wherein determining an updated policy comprises: determining an objective-specific policy for each objective in the plurality of objectives, each objective-specific policy being determined based on the corresponding action-value function for the corresponding objective (Schaul Algorithm 1 UVFA learning from Horde targets, lines 6-25; Section 2 Background “Any goal admits an optimal policy…”; Section 3 Universal Value Function Approximators “Our main idea is to represent a large set of optimal value functions by a single, unified function approximator that generalizes over both states and goals. Specifically, we consider function approximators V (s, g; θ) ≈ V∗g(s) orQ(s, a, g; θ) ≈ Q∗g(s, a), parameterized by θ ∈ R d, that approximate the optimal value function both over a potentially large state space s ∈ S, and also a potentially large goal space g ∈ G. Figure 1 schematically depicts possible function approximators: the most direct approach, F : S × G 7→ R simply concatenates state and goal together as a joint input. The mapping from concatenated input to regression target can then be dealt with a non-linear function approximator such as a multi-layer perceptron (MLP).” Schaul provides an iterative process for determining policies based on objectives/goals and corresponding action-value functions, corresponding to determining an objective-specific policy for each objective in the plurality of objectives, each objective-specific policy being determined based on the corresponding action-value function for the corresponding objective); and determining the updated policy by fitting a set of policy parameters of the updated policy to a combination of the objective-specific policies (Schaul Algorithm 1 UVFA learning from Horde targets, lines 11-25; Section 2 Background “Any goal admits an optimal policy…”; Section 3 Universal Value Function Approximators “Specifically, we consider function approximators V (s, g; θ) ≈ V∗g(s) orQ(s, a, g; θ) ≈ Q∗g(s, a), parameterized by θ ∈ Rd, that approximate the optimal value function both over a potentially large state space s ∈ S, and also a potentially large goal space g ∈ G.” Schaul provides an iterative process for determining an optimal policy through Algorithm 1 including parameterization, corresponding to determining the updated policy by fitting a set of policy parameters of the updated policy to a combination of the objective-specific policies.).
	It would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Schaul in view of Van Seijen for the same reasons disclosed above in the rejection of claim 1.

Regarding claim 3, Schaul in view of Van Seijen teaches the method of claim 2 as discussed above in the rejection of claim 2, wherein fitting the set of policy parameters of the updated policy to the combination of the objective-specific policies comprises determining the set of policy parameters that minimizes a difference between the updated policy and the combination of the objective-specific policies (Schaul Algorithm 1 UVFA learning from Horde targets, lines 11-25; Section 2 Background “Any goal admits an optimal policy…”; Section 3 Universal Value Function Approximators “Specifically, when using MLPs for φ and ψ, the parameters of their first layers may be shared, so that common features are learned for both states and goals. Second, the UVFA may be known to be symmetric (i.e. V∗g(s) = V∗s(g) ∀s, g), for example a UVFA that computes the distance between state s and goal g in a reversible environment. This symmetry can be exploited by using the same network φ = ψ, and a symmetric output function h (e.g. dot product). We will refer to these two cases as partially symmetric and symmetric architectures respectively.” Schaul provides parameter sharing such that common features are learned for both states and goals, as also portrayed by Algorithm 1, corresponding to determining the set of policy parameters that minimizes a difference between the updated policy and the combination of the objective-specific policies.).
It would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Schaul in view of Van Seijen for the same reasons disclosed above in the rejection of claim 2.

Regarding claim 5, Schaul in view of Van Seijen teaches the method of claim 2 as discussed above in the rejection of claim 2, wherein determining an objective-specific policy for each objective comprises determining objective-specific policy parameters for the objective-specific policy that increase an expected return according to the action-value function for the corresponding objective relative to the previous policy (Schaul Section 3 Universal Value Function Approximators “Specifically, we consider function approximators V (s, g; θ) ≈ V∗g(s) orQ(s, a, g; θ) ≈ Q∗g(s, a), parameterized by θ ∈ Rd, that approximate the optimal value function both over a potentially large state space s ∈ S, and also a potentially large goal space g ∈ G.”; Section 4 Supervised Learning Experiments “We ran several experiments to investigate the generalization capabilities of UVFAs. In each case, the scenario is one of supervised learning, where the ground truth values V∗g(s) or Q∗g(s, a) are only given for some training set of pairs (s, g). We trained a UVFA on that data, and evaluated its generalization capability in two ways. First, we measured the prediction error (MSE) on the value of a held-out set of unseen (s, g) pairs. Second, we measured the policy quality of a value function approximator Qˆ(s, a, g; θ) to be the true expected discounted reward according to its goal g, averaged over all start states, when following the soft-max policy of these values with temperature τ , as compared to doing the same with the optimal value function. A nonzero temperature makes this evaluation criterion change smoothly with respect to the parameters, which gives partial credit to near-perfect policies as in Figure 8. We normalize the policy quality such that optimal behavior has a score of 1, and the uniform random policy scores 0.” Schaul provides determining policy quality with respect to the action-value functions and associated parameters corresponding to determining objective-specific policy parameters for the objective-specific policy that increase the expected return according to the action-value function for the corresponding objective relative to the previous policy.).
It would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Schaul in view of Van Seijen for the same reasons disclosed above in the rejection of claim 2.

Regarding claim 14, Schaul in view of Van Seijen teaches the method of claim 1 as discussed above in the rejection of claim 1, wherein each separate action-value function provides a distribution of action values for a corresponding objective of the plurality of objectives across a range of potential state-action pairs for the previous policy (Van Seijen [0068] “Alternatively, each piece of fruit 102 may be assigned to a specific agent whose only learning objective is to estimate the optimal action-value function for reaching that piece of fruit 102… Moreover, these n agents can learn in parallel using off-policy learning. Therefore, the machine learning problem becomes much easier.”; [0072] “In a present embodiment, a single-agent task is defined by a Markov decision process (MDP), including the tuple custom-characterX, A, p, r, γcustom-character, where X is the set of states; A is the set of actions; p(x′|x, a) indicates the probability of a transition to state x′ϵX, when action aϵA is taken in state xϵX; r(x, a, x′) indicates the reward for a transition from state x to state x′ under action a; finally, the discount factor γ specifies how future rewards are weighted with respect to the immediate reward. An agent that tries to solve an MDP without decomposition can be referred to as a “flat agent”. A flat agent can be defined by an MDP including the tuple custom-characterX.sup.flat, A.sup.flat, p.sup.flat, r.sup.flat, γ.sup.flatcustom-character. A performance objective of a SoC model can be to maximize a flat return defined by r.sup.flat and γ.sup.flat.”; [0073] “Each policy π has a corresponding action-value function, q.sub.π(x, a), which gives the expected value of the return G.sub.t conditioned on the state xϵX and action” Van Seijen provides separate action-value functions and respective probability distributions, given by p(x′|x, a), based on action-value function implementation including states, actions and respective policies, corresponding to each separate action-value function provides a distribution of action values for a corresponding objective of the plurality of objectives across a range of potential state-action pairs for the previous policy.).
Schaul and Van Seijen are both considered to be analogous to the claimed invention because they are in the same field of artificial intelligence and more specifically reinforcement learning. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Schaul with the above teachings of Van Seijen. Doing so would allow for specialized agents for different parts of a task and a way of transferring trained agents (Van Seijen [0006] “This approach has at least the following advantages: 1) it allows for specialized agents for different parts of the task, and 2) it provides a new way to transfer knowledge, by transferring trained agents”).

Regarding claim 15, Schaul in view of Van Seijen teaches the method of claim 1 as discussed above in the rejection of claim 1, wherein each separate action-value function outputs an action-value representing an expected cumulative discounted reward for the corresponding objective when choosing a given action in response to a given state (Van Seijen [0154] “A trajectory custom-characterx(t), a(t), x(t+1), x(t+1), r(t)custom-character.sub.tϵ[0,T−1] is the projection into the MDP of the task episode. In some examples, a goal is to generate trajectories with high discounted cumulative reward, also called the return: Σ.sub.t=0.sup.T−1γ.sup.tr(t). To do so, one needs to find a policy π: X×A.fwdarw.[0,1] that yields optimal expected returns. Formally, this means finding a policy that maximizes:” Van Seijen provides an action-value function representing the expected cumulative discounted reward for the corresponding objective when choosing a given action in response to a given state.).
Schaul and Van Seijen are both considered to be analogous to the claimed invention because they are in the same field of artificial intelligence and more specifically reinforcement learning. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Schaul with the above teachings of Van Seijen. Doing so would allow for specialized agents for different parts of a task and a way of transferring trained agents (Van Seijen [0006] “This approach has at least the following advantages: 1) it allows for specialized agents for different parts of the task, and 2) it provides a new way to transfer knowledge, by transferring trained agents”).

Regarding claim 16, Schaul teaches a method for training a neural network system by reinforcement learning (Schaul Section 3.1 Supervised Learning of UVFAs “We consider two approaches to learning UVFAs, first using a direct end-to-end training procedure, and second using a two-stage training procedure that exploits the factorized structure of a two-stream function approximator.”; Section 4.2 Interpolation “The data matrix M is constructed from all states, but only the goals in the training set (half of all possible states, randomly selected); a separate three-layer MLP is used for φ and ψ, and training follows our proposed two-stage approach (lines 17 to 24 in Algorithm 1 below; see also Section 3.1 and Appendix B)” Schaul provides a method for training a neural network system by reinforcement learning as summarized in Algorithm 1 and Section 3.1.), the neural network system being configured to receive an input observation characterizing a state of an environment interacted with by an agent and to select and output an action in accordance with a policy that aims to satisfy a plurality of objectives (Schaul Section 3 Universal Value Function Approximators “Figure 1 schematically depicts possible function approximators: the most direct approach, F : S × G 7→ R simply concatenates state and goal together as a joint input. The mapping from concatenated input to regression target can then be dealt with a non-linear function approximator such as a multi-layer perceptron (MLP).” Schaul, as shown in Figure 1 (see “s” displayed in Figure), provides an input observation characterizing a state of an environment interacted with by an agent and to select and output an action in accordance with a policy that aims to satisfy a plurality of objectives, as displayed by the outputs in Figure 1.), the method comprising: obtaining a set of one or more trajectories, each trajectory comprising a state of an environment, an action applied by the agent to the environment according to a previous policy in response to the state, and a set of rewards for the action, each reward relating to a corresponding objective of the plurality of objectives (Schaul Algorithm 1 UVFA learning from Horde targets, lines 2-4; Section 4 Supervised Learning Experiments “In this paper we will side-step the thorny issue of where goals come from, and how they are represented; instead, if not mentioned otherwise, we will explore the simple case where goals are states themselves, i.e. G ⊂ S and entering a goal is rewarded. The resulting pseudo-discount and pseudo-reward functions can then be defined as: …where γext is the external discount function.” Schaul, as shown in lines 3-4 of Algorithm 1, provides obtaining trajectories which comprise a state of an environment, an action applied by the agent to the environment according to a previous policy in response to the state, and a set of rewards for the action, each reward relating to a corresponding objective of the plurality of objectives.).
Schaul fails to teach determining a separate probability distribution for each of the plurality of objectives based on the set of one or more trajectories, each separate probability distribution providing a distribution of action probabilities that would increase an expected return according to a corresponding objective relative to the policy; combining the separate probability distributions of the plurality of objectives; and determining an updated multi-objective policy based on combination of the separate probability distributions for the plurality of objectives.
However, Van Seijen teaches determining a separate probability distribution for each of the plurality of objectives based on the set of one or more trajectories, each separate probability distribution providing a distribution of action probabilities that would increase an expected return according to a corresponding objective relative to the policy (Van Seijen [0068] “Alternatively, each piece of fruit 102 may be assigned to a specific agent whose only learning objective is to estimate the optimal action-value function for reaching that piece of fruit 102… Moreover, these n agents can learn in parallel using off-policy learning. Therefore, the machine learning problem becomes much easier.”; [0072] “X is the set of states; A is the set of actions; p(x′|x, a) indicates the probability of a transition to state x′ϵX”; [0073] “Actions a are taken at discrete time steps according to policy π, which maps states to actions. For example, actions a may be taken at discrete time steps t=0, 1, 2, . . . according to a policy π: X×A.fwdarw.[0,1], which defines for each action the selection probability conditioned on the state. Each policy π has a corresponding action-value function…”; [0170] “Given any fixed aggregator, stable (off -policy) learning can occur if the advisors are Markov because each advisor can be seen as an independent learner training from trajectories controlled by an arbitrary behavioral policy.”; Van Seijen provides a plurality of agents determining respective action-value functions in parallel based on their respective objective, including use of policy and trajectories and probability distribution formula p(x′|x, a), corresponding to determining a separate probability distribution for each of the plurality of objectives based on the set of one or more trajectories, each separate probability distribution providing a distribution of action probabilities that would increase an expected return according to a corresponding objective relative to the policy.); combining the separate probability distributions of the plurality of objectives (Van Seijen [0068] “Alternatively, each piece of fruit 102 may be assigned to a specific agent whose only learning objective is to estimate the optimal action-value function for reaching that piece of fruit 102. This agent sees a reward of +1 only if its assigned fruit 102 is reached and otherwise sees no reward. The state-space for this agent can ignore all other fruit 102 because they are irrelevant for its value function. An aggregator can then make the final action selection from among the agents of each piece of fruit 102.”; [0238] “Because each agent has its own reward function, each agent i also has its own Q-value function associated with it: Q.sub.i(s, a; θ). To derive a policy from these multiple action-value functions, an aggregator receives the action-values (i.e., a single value for each action), using the same linear combination as used in the reward decomposition.” Van Seijen provides aggregating action-value functions which include probability distributions corresponding to combining the separate probability distributions of the plurality of objectives.); and determining an updated multi-objective policy based on combination of the separate probability distributions for the plurality of objectives (Van Seijen [0062] “Alternatively, each piece of fruit 102 may be assigned to a specific agent whose only learning objective is to estimate the optimal action-value function for reaching that piece of fruit 102. This agent sees a reward of +1 only if its assigned fruit 102 is reached and otherwise sees no reward. The state-space for this agent can ignore all other fruit 102 because they are irrelevant for its value function. An aggregator can then make the final action selection from among the agents of each piece of fruit 102.”; [0082] “Given a learning method that converges to the optimal policy on a single-agent MDP task, applying this method independently to each of the agents of the SoC model, the overall policy of the SoC model converges to a fixed point. Moreover, this fixed point can depend only on the SoC model and not on the particular learning algorithm that is being used.”; [0248] “In some examples, HRA can apply multi-objective learning to smooth a value function of a single reward function.” Van Seijen provides determining an optimal multi-objective policy based on action-value function aggregation, corresponding to determining an updated multi-objective policy based on combination of the separate probability distributions for the plurality of objectives.).
Schaul and Van Seijen are both considered to be analogous to the claimed invention because they are in the same field of artificial intelligence and more specifically reinforcement learning. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Schaul with the above teachings of Van Seijen. Doing so would allow for specialized agents for different parts of a task and a way of transferring trained agents (Van Seijen [0006] “This approach has at least the following advantages: 1) it allows for specialized agents for different parts of the task, and 2) it provides a new way to transfer knowledge, by transferring trained agents”).

Regarding claim 17, Schaul in view of Van Seijen teaches the method of claim 16 as discussed above in the rejection of claim 16, wherein: determining a separate probability distribution for each of the plurality of objectives comprises, for each objective: determining a value function defining a value representing an expected return according to the corresponding objective that would result from the agent following the previous policy from a given state (Van Seijen [0073] “Actions a are taken at discrete time steps according to policy π, which maps states to actions. For example, actions a may be taken at discrete time steps t=0, 1, 2, . . . according to a policy π: X×A.fwdarw.[0,1], which defines for each action the selection probability conditioned on the state. Each policy π has a corresponding action-value function, q.sub.π(x, a), which gives the expected value of the return G.sub.t conditioned on the state xϵX and action aϵA: q.sub.π(x, a)=custom-character{G.sub.t|X.sub.t=x, A.sub.t=a, π}. A goal is to maximize the discounted sum of rewards, also referred to as the return: G.sub.t:=Σ.sub.k−1.sup.∞γ.sup.k−1R.sub.t+k.” Van Seijen provides determining expected return from previous policies including determining a corresponding action-value function defining a value representing an expected return according to a previous policy π.); and determining the separate probability distribution for the objective based on the value function (Van Seijen [0072] “In a present embodiment, a single-agent task is defined by a Markov decision process (MDP), including the tuple custom-characterX, A, p, r, γcustom-character, where X is the set of states; A is the set of actions; p(x′|x, a) indicates the probability of a transition to state x′ϵX, when action aϵA is taken in state xϵX; r(x, a, x′) indicates the reward for a transition from state x to state x′ under action a; finally, the discount factor γ specifies how future rewards are weighted with respect to the immediate reward. An agent that tries to solve an MDP without decomposition can be referred to as a “flat agent”. A flat agent can be defined by an MDP including the tuple custom-characterX.sup.flat, A.sup.flat, p.sup.flat, r.sup.flat, γ.sup.flatcustom-character. A performance objective of a SoC model can be to maximize a flat return defined by r.sup.flat and γ.sup.flat.” Van Seijen provides respective probability distributions, given by p(x′|x, a), based on action-value function implementation including a value function for an objective, corresponding to determining the separate probability distribution for the objective based on the value function.).
Schaul and Van Seijen are both considered to be analogous to the claimed invention because they are in the same field of artificial intelligence and more specifically reinforcement learning. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Schaul with the above teachings of Van Seijen. Doing so would allow for specialized agents for different parts of a task and a way of transferring trained agents (Van Seijen [0006] “This approach has at least the following advantages: 1) it allows for specialized agents for different parts of the task, and 2) it provides a new way to transfer knowledge, by transferring trained agents”).

Regarding claim 18, Schaul in view of Van Seijen teaches the method of claim 17 as discussed above in the rejection of claim 17, wherein: each separate probability distribution is a state-action distribution defining a distribution of probabilities of state-action pairs and the value function for each objective is a state-value function defining a value representing an expected return according to the corresponding objective that would result from the agent following the previous policy from a given state; or each separate probability distribution is an objective-specific policy defining a distribution of probabilities of actions over states and the value function is an action-value function representing an expected return according to the corresponding objective that would result from the agent performing a given action in response to a given state according to the previous policy (Van Seijen [0072] “In a present embodiment, a single-agent task is defined by a Markov decision process (MDP), including the tuple custom-characterX, A, p, r, γcustom-character, where X is the set of states; A is the set of actions; p(x′|x, a) indicates the probability of a transition to state x′ϵX, when action aϵA is taken in state xϵX; r(x, a, x′) indicates the reward for a transition from state x to state x′ under action a; finally, the discount factor γ specifies how future rewards are weighted with respect to the immediate reward. An agent that tries to solve an MDP without decomposition can be referred to as a “flat agent”. A flat agent can be defined by an MDP including the tuple custom-characterX.sup.flat, A.sup.flat, p.sup.flat, r.sup.flat, γ.sup.flatcustom-character. A performance objective of a SoC model can be to maximize a flat return defined by r.sup.flat and γ.sup.flat.”; [0073] “Each policy π has a corresponding action-value function, q.sub.π(x, a), which gives the expected value of the return G.sub.t conditioned on the state xϵX and action” Van Seijen provides respective probability distributions, given by p(x′|x, a), based on action-value function implementation including states and actions, and an expected return corresponding to each separate probability distribution is a state-action distribution defining a distribution of probabilities of state-action pairs and the value function for each objective is a state-value function defining a value representing an expected return according to the corresponding objective that would result from the agent following the previous policy from a given state.).
Schaul and Van Seijen are both considered to be analogous to the claimed invention because they are in the same field of artificial intelligence and more specifically reinforcement learning. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Schaul with the above teachings of Van Seijen. Doing so would allow for specialized agents for different parts of a task and a way of transferring trained agents (Van Seijen [0006] “This approach has at least the following advantages: 1) it allows for specialized agents for different parts of the task, and 2) it provides a new way to transfer knowledge, by transferring trained agents”).

Regarding claim 19, it is the system embodiment of claim 1 with similar limitations to claim 1 and is rejected using the same reasoning found above in the rejection of claim 1. Further, Schaul teaches a system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations (Schaul Section 5.2 Ms Pacman “Scaling up, we applied our approach to learn a UVFA for the Atari 2600 game Ms. Pacman. We used a hand-crafted goal space G: for each pellet on the screen, we defined eating it as an individual goal g ∈ R 2 , which is represented by the pellet’s (x, y) coordinate on-screen. Following Algorithm 1, a Horde with 150 demons was trained.”; Figure 3 “Note that two-stage training is impossible for the concatenated network. Dotted lines indicate minimum and maximum observed values among 10 random seeds. The computational cost of the matrix factorization is orders of magnitudes smaller than the regression and omitted in these plots.” Schaul provides measuring computational cost and implementing training on a system corresponding to a system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations.).
It would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Schaul in view of Van Seijen for the same reasons disclosed above in the rejection of claim 1.

Regarding claim 21, Schaul in view of Van Seijen teaches the method of claim 17 as discussed above in the rejection of claim 19, wherein determining an updated policy comprises: determining an objective-specific policy for each objective in the plurality of objectives, each objective-specific policy being determined based on the corresponding separate action-value function for the corresponding objective (Van Seijen [0063] “For instance, SoC splits a single-agent problem into multiple parallel, communicating agents with simpler and more focused, but different objectives (e.g., skills). An introductory example is detailed below with reference to FIG. 1.”; [0073] “Each policy π has a corresponding action-value function, q.sub.π(x, a), which gives the expected value of the return G.sub.t conditioned on the state xϵX and action aϵA: q.sub.π(x, a)=custom-character{G.sub.t|X.sub.t=x, A.sub.t=a, π}. A goal is to maximize the discounted sum of rewards, also referred to as the return: G.sub.t:=Σ.sub.k−1.sup.∞γ.sup.k−1R.sub.t+k.” Van Seijen provides a plurality of policies with corresponding action-value functions and differing objectives, corresponding to determining an objective-specific policy for each objective in the plurality of objectives, each objective-specific policy being determined based on the corresponding separate action-value function for the corresponding objective.); and determining the updated policy by fitting a set of policy parameters of the updated policy to a combination of the objective-specific policies (Van Seijen [0073] “Actions a are taken at discrete time steps according to policy π, which maps states to actions. For example, actions a may be taken at discrete time steps t=0, 1, 2, according to a policy π: X×A.fwdarw.[0,1], which defines for each action the selection probability conditioned on the state. Each policy π has a corresponding action-value function, q.sub.π(x, a), which gives the expected value of the return G.sub.t conditioned on the state xϵX and action”; [0163] “Algorithm diversification where each algorithm intends to learn a global policy but with different optimizations, hypotheses, regularizers, and/or parameters.” Van Seijen provides updating policy parameters to determine optimal policies, corresponding to determining the updated policy by fitting a set of policy parameters of the updated policy to a combination of the objective-specific policies.).
Schaul and Van Seijen are both considered to be analogous to the claimed invention because they are in the same field of artificial intelligence and more specifically reinforcement learning. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Schaul with the above teachings of Van Seijen. Doing so would allow for specialized agents for different parts of a task and a way of transferring trained agents (Van Seijen [0006] “This approach has at least the following advantages: 1) it allows for specialized agents for different parts of the task, and 2) it provides a new way to transfer knowledge, by transferring trained agents”).

Claims 4, 6-7, and 9-10 are rejected under 35 U.S.C. 103 as being unpatentable over Schaul et al. (“Universal Value Function Approximators”) (“Schaul”) in view of Van Seijen et al. (U.S. Patent Publication No. 2018/0165603) (“Van Seijen”) in further view of Haarnoja et al. (“Composable Deep Reinforcement Learning for Robotic Manipulation”) (“Haarnoja”).

Regarding claim 4, Schaul in view of Van Seijen teaches the method of claim 2 as discussed above in the rejection of claim 2, but fails to teach wherein the set of policy parameters for the updated policy are constrained such that a difference between the updated policy and the previous policy cannot exceed a trust region threshold.
However, Haarnoja teaches wherein the set of policy parameters for the updated policy are constrained such that a difference between the updated policy and the previous policy cannot exceed a trust region threshold (Haarnoja Section IV.A Learning with Multiple Objectives “How should we combine the policies π∗i ? A simple approach is to approximate the optimal Q-function of the composed task Q∗C with the mean of the individual Q functions: (Eq.6) where QΣ represents an approximation to the true optimal Q-function of the composed task Q∗C. One can then extract a policy πΣ from this approximate Q-function using any policy-extraction algorithm. In conventional reinforcement learning without entropy regularization, we cannot make any guarantees about how close QΣ is to Q∗C. However, we show in the next section that, if the constituent policies represent optimal maximum entropy policies, then we can bound the difference between the value of the approximate policy QπΣ C and the optimal value Q∗C for the combined task.”; Section V.A Experimental setup “We parameterize the Q-function and the policies with a 2-layer neural network with 100 or 200 units in each layer and rectifier linear activations.” Haarnoja provides bounding the difference between approximate and optimal polices for a combined task including the use of parameterization corresponding to constraining policy parameters such that the difference between the updated policy and the previous policy cannot exceed a trust region threshold.).
Schaul, Van Seijen and Haarnoja are all considered to be analogous to the claimed invention because they are in the same field of artificial intelligence and more specifically reinforcement learning policies. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Schaul in view of Van Seijen with the above teachings of Haarnoja. Doing so would allow for new policies that can be optimally bounded in terms of divergence between composed policies resulting in a large gain in efficiency over training from scratch (Haarnoja Abstract “Second, we show that policies learned with soft Q-learning can be composed to create new policies, and that the optimality of the resulting policy can be bounded in terms of the divergence between the composed policies. This compositionality provides an especially valuable tool for real-world manipulation, where constructing new policies by composing existing skills can provide a large gain in efficiency over training from scratch.”).

Regarding claim 6, Schaul in view of Van Seijen teaches the method of claim 5 as discussed above in the rejection of claim 5, but fails to teach wherein determining the objective-specific policy for each objective comprises determining objective-specific policy parameters for the objective-specific policy that maximize the expected return according to the separate action-value function for the corresponding objective relative to the previous policy, subject to a constraint that the objective- specific policy may not differ from the previous policy by more than a corresponding difference threshold.
However, Haarnoja teaches wherein determining the objective-specific policy for each objective comprises determining objective-specific policy parameters for the objective-specific policy that maximize the expected return according to the separate action-value function for the corresponding objective relative to the previous policy (Haarnoja Section I Introduction “As a consequence, the policy will automatically direct exploration into regions of higher expected return.”; Section III.B Maximum Entropy Reinforcement Learning “The standard RL objective seeks a policy that maximizes the expected sum of rewards t E(st,at)∼ρπ [r(st, at)]. In this paper, we consider a more general maximum entropy objective [6], [32], [25], [31], which favors stochastic policies by augmenting the objective with an expected entropy over ρπ:”; Section IV Compositionality of Maximum Entropy Policies “In this section, we will discuss how soft Q-functions for different tasks can be combined additively to solve multiple tasks simultaneously. While simply adding Q-functions does not generally give the Q-function for the combined task, we show that the regret from using the policy obtained by adding the constituent Q-functions together is upper bounded by the difference between the two policies that are being composed. Intuitively, if two composed policies agree on an action, or if they are indifferent towards each other’s actions, then the composed policy will be closer to the optimal one.” Haarnoja provides reward maximizing corresponding to determining objective-specific policy parameters for the objective-specific policy that maximize the expected return according to the action-value function for the corresponding objective relative to the previous policy, including a plurality of separate q-functions which may be combined, corresponding to separate action-value functions.), subject to a constraint that the objective- specific policy may not differ from the previous policy by more than a corresponding difference threshold (Haarnoja Abstract “Second, we show that policies learned with soft Q-learning can be composed to create new policies, and that the optimality of the resulting policy can be bounded in terms of the divergence between the composed policies.”; Section IV.A Learning with Multiple Objectives “How should we combine the policies π∗i ? A simple approach is to approximate the optimal Q-function of the composed task Q∗C with the mean of the individual Q functions: (Eq.6) where QΣ represents an approximation to the true optimal Q-function of the composed task Q∗C. One can then extract a policy πΣ from this approximate Q-function using any policy-extraction algorithm. In conventional reinforcement learning without entropy regularization, we cannot make any guarantees about how close QΣ is to Q∗C. However, we show in the next section that, if the constituent policies represent optimal maximum entropy policies, then we can bound the difference between the value of the approximate policy QπΣ C and the optimal value Q∗C for the combined task.”; Section V.A Experimental setup “We parameterize the Q-function and the policies with a 2-layer neural network with 100 or 200 units in each layer and rectifier linear activations.” Haarnoja provides bounding divergence between previous policies corresponding to a constraint that the objective- specific policy may not differ from the previous policy by more than a corresponding difference threshold.).
Schaul, Van Seijen and Haarnoja are all considered to be analogous to the claimed invention because they are in the same field of artificial intelligence and more specifically reinforcement learning policies. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Schaul in view of Van Seijen with the above teachings of Haarnoja. Doing so would allow for new policies that can be optimally bounded in terms of divergence between composed policies resulting in a large gain in efficiency over training from scratch (Haarnoja Abstract “Second, we show that policies learned with soft Q-learning can be composed to create new policies, and that the optimality of the resulting policy can be bounded in terms of the divergence between the composed policies. This compositionality provides an especially valuable tool for real-world manipulation, where constructing new policies by composing existing skills can provide a large gain in efficiency over training from scratch.”).

Regarding claim 7, Schaul in view of Van Seijen in further view of Haarnoja teaches the method of claim 6 as discussed above in the rejection of claim 6, wherein the corresponding difference threshold represents a relative contribution of the corresponding objective to the updated policy (Haarnoja Section IV.A Learning with Multiple Objectives “In conventional reinforcement learning without entropy regularization, we cannot make any guarantees about how close QΣ is to Q∗C. However, we show in the next section that, if the constituent policies represent optimal maximum entropy policies, then we can bound the difference between the value of the approximate policy QπΣ C and the optimal value Q∗C for the combined task.” Haarnoja provides a policy bound for divergence corresponding to a difference threshold which represents a relative contribution of the corresponding objective to the updated policy.).
Schaul, Van Seijen and Haarnoja are all considered to be analogous to the claimed invention because they are in the same field of artificial intelligence and more specifically reinforcement learning policies. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Schaul in view of Van Seijen with the above teachings of Haarnoja. Doing so would allow for new policies that can be optimally bounded in terms of divergence between composed policies resulting in a large gain in efficiency over training from scratch (Haarnoja Abstract “Second, we show that policies learned with soft Q-learning can be composed to create new policies, and that the optimality of the resulting policy can be bounded in terms of the divergence between the composed policies. This compositionality provides an especially valuable tool for real-world manipulation, where constructing new policies by composing existing skills can provide a large gain in efficiency over training from scratch.”).

Regarding claim 9, Schaul in view of Van Seijen teaches the method of claim 2 as discussed above in the rejection of claim 2, but fails to teach wherein each objective-specific policy, qk(a|s), is determined from a scaled action-value function for the objective of the objective-specific policy, wherein the scaled action-value function is scaled by a value dependent upon a preference for the objective.
However, Haarnoja teaches wherein each objective-specific policy, qk(a|s), is determined from a scaled action-value function for the objective of the objective-specific policy, wherein the scaled action-value function is scaled by a value dependent upon a preference for the objective (Haarnoja Section IV.A Learning with Multiple Objectives “How should we combine the policies π∗i ? A simple approach is to approximate the optimal Q-function of the composed task Q∗C with the mean of the individual Q functions: (Eq.6) where QΣ represents an approximation to the true optimal Q-function of the composed task Q∗C. One can then extract a policy πΣ from this approximate Q-function using any policy-extraction algorithm. In conventional reinforcement learning without entropy regularization, we cannot make any guarantees about how close QΣ is to Q∗C. However, we show in the next section that, if the constituent policies represent optimal maximum entropy policies, then we can bound the difference between the value of the approximate policy QπΣ C and the optimal value Q∗C for the combined task.” Haarnoja, as shown in Equation 6, provides a scaled action-value function for the objective of the objective-specific policy, wherein the scaled action-value function is scaled by a value dependent upon a preference for the objective.).
Schaul, Van Seijen and Haarnoja are all considered to be analogous to the claimed invention because they are in the same field of artificial intelligence and more specifically reinforcement learning policies. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Schaul in view of Van Seijen with the above teachings of Haarnoja. Doing so would allow for new policies that can be optimally bounded in terms of divergence between composed policies resulting in a large gain in efficiency over training from scratch (Haarnoja Abstract “Second, we show that policies learned with soft Q-learning can be composed to create new policies, and that the optimality of the resulting policy can be bounded in terms of the divergence between the composed policies. This compositionality provides an especially valuable tool for real-world manipulation, where constructing new policies by composing existing skills can provide a large gain in efficiency over training from scratch.”).

Regarding claim 10, Schaul in view of Van Seijen in further view of Haarnoja teaches the method of claim 9 as discussed above in the rejection of claim 9, wherein the value dependent upon a preference for the objective is dependent on a difference threshold for the objective (Haarnoja Section IV.B Bounding the Sub-Optimality of Composed Policies “To understand what we can expect from the performance of the composed policy πΣ that is induced by QΣ, we analyze how the value of πΣ relates to the unknown optimal Q-function Q∗C corresponding to the composed reward rC. For simplicity, we consider the special case where α = 1 and we compose just two optimal policies, given by π∗i, with Q-functions Q∗i and reward functions ri. Extending the proof to more than two policies and other values of α is straightforward. We start by introducing a lower bound for the optimal combined Q-function in terms of Q∗i and π∗i .” Haarnoja provides a value dependent upon a preference for the objective is dependent on the difference threshold for the objective.).
Schaul, Van Seijen and Haarnoja are all considered to be analogous to the claimed invention because they are in the same field of artificial intelligence and more specifically reinforcement learning policies. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Schaul in view of Van Seijen with the above teachings of Haarnoja. Doing so would allow for new policies that can be optimally bounded in terms of divergence between composed policies resulting in a large gain in efficiency over training from scratch (Haarnoja Abstract “Second, we show that policies learned with soft Q-learning can be composed to create new policies, and that the optimality of the resulting policy can be bounded in terms of the divergence between the composed policies. This compositionality provides an especially valuable tool for real-world manipulation, where constructing new policies by composing existing skills can provide a large gain in efficiency over training from scratch.”).

Claims 8 and 11-13 are rejected under 35 U.S.C. 103 as being unpatentable over Schaul et al. (“Universal Value Function Approximators”) (“Schaul”) in view of Van Seijen et al. (U.S. Patent Publication No. 2018/0165603) (“Van Seijen”) in further view of Abdolmaleki et al. (“Relative Entropy Regularized Policy Iteration”) (“Abdolmaleki”).

Regarding claim 8, Schaul in view of Van Seijen teaches the method of claim 2 as discussed above in the rejection of claim 2, but fails to teach wherein the objective-specific policies are non-parametric policies.
However, Abdolmaleki teaches wherein the objective-specific policies are non-parametric policies (Abdolmaleki Abstract “The result is a simple procedure consisting of three steps: i) policy evaluation by estimating a parametric action-value function; ii) policy improvement via the estimation of a local non-parametric policy; and iii) generalization by fitting a parametric policy.” Abdolmaleki provides objective-specific policies are non-parametric policies.).
Schaul, Van Seijen and Abdolmaleki are all considered to be analogous to the claimed invention because they are in the same field of artificial intelligence and more specifically reinforcement learning policies. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Schaul in view of Van Seijen with the above teachings of Abdolmaleki. Doing so would allow for separating the neural network fitting from the reinforcement learning procedure and enabling regularization (Abdolmaleki Section 4 Policy Improvement “This split of the improvement step into sample based estimation followed by supervised learning allows us to separate the neural network fitting from the RL procedure, enabling regularization in the latter”).

Regarding claim 11, Schaul in view of Van Seijen in further view of Abdolmaleki teaches the method of claim 8 as discussed above in the rejection of claim 8, wherein each objective-specific policy, qk(a|s), is determined by calculating:

    PNG
    media_image1.png
    50
    269
    media_image1.png
    Greyscale

where: N is a normalization constant; k is the objective; a is an action; s is a state; πold(a|s) is the previous policy; Qk(s,a) is the separate action-value function for the objective; and ƞk is a temperature parameter (Abdolmaleki Section B.2.1 Dual Function Derivation “Setting it to zero and rearranging terms we get q(a|s) = … However the last exponential term is a normalization constant for q.” Abdolmaleki provides the policy equation on the top of page 17, which is the equation of claim 11, including the normalization constant.).
Schaul, Van Seijen and Abdolmaleki are all considered to be analogous to the claimed invention because they are in the same field of artificial intelligence and more specifically reinforcement learning policies. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Schaul in view of Van Seijen with the above teachings of Abdolmaleki. Doing so would allow for separating the neural network fitting from the reinforcement learning procedure and enabling regularization (Abdolmaleki Section 4 Policy Improvement “This split of the improvement step into sample based estimation followed by supervised learning allows us to separate the neural network fitting from the RL procedure, enabling regularization in the latter”).

Regarding claim 12, Schaul in view of Van Seijen in further view of Abdolmaleki teaches the method of claim 11 as discussed above in the rejection of claim 11, wherein, for each objective, k, the temperature parameter ƞk is determined by solving the following equation:

    PNG
    media_image2.png
    49
    480
    media_image2.png
    Greyscale

where:ϵk is a difference threshold for the corresponding objective; and µ(s) is a visitation distribution (Abdolmaleki Section B.1 E-Step (Step 2 in main paper) “Note that In our case q(a|s) is a non-parametric and samples based distribution, and we can solve this constraint optimization in close form for each sample state s, … and easily optimize for the correct η using the convex dual function.” Abdolmaleki provides the equation on the bottom of page 15, which is the equation of claim 12.).
Schaul, Van Seijen and Abdolmaleki are all considered to be analogous to the claimed invention because they are in the same field of artificial intelligence and more specifically reinforcement learning policies. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Schaul in view of Van Seijen with the above teachings of Abdolmaleki. Doing so would allow for separating the neural network fitting from the reinforcement learning procedure and enabling regularization (Abdolmaleki Section 4 Policy Improvement “This split of the improvement step into sample based estimation followed by supervised learning allows us to separate the neural network fitting from the RL procedure, enabling regularization in the latter”).

Regarding claim 13, Schaul in view of Van Seijen in further view of Abdolmaleki teaches the method of claim 12 as discussed above in the rejection of claim 12, wherein each temperature parameter is determined via gradient descent (Abdolmaleki Section 4.1 Finding action weights (Step 2) “The temperature η corresponding to the constraint can be found automatically by solving the following convex dual function alongside our policy optimization:… We found that, in practice, this optimization can be performed via a few steps of gradient descent on η for each batch after the weight calculation” Abdolmaleki provides the temperature parameter is determined via gradient descent.).
Schaul, Van Seijen and Abdolmaleki are all considered to be analogous to the claimed invention because they are in the same field of artificial intelligence and more specifically reinforcement learning policies. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Schaul in view of Van Seijen with the above teachings of Abdolmaleki. Doing so would allow for separating the neural network fitting from the reinforcement learning procedure and enabling regularization (Abdolmaleki Section 4 Policy Improvement “This split of the improvement step into sample based estimation followed by supervised learning allows us to separate the neural network fitting from the RL procedure, enabling regularization in the latter”).



Response to Arguments
Regarding the rejection applied under 35 U.S.C. 101, Applicant firstly asserts that any alleged abstract ideas are integrated into a practical application that constitutes a specific technical improvement (“Remarks”, Page 1). 
Applicant further asserts that the claims recite a method that requires obtaining sets of trajectories, determining multiple, complex action-value functions for a neural network system, and determining an updated policy by fitting parameters. Applicant further asserts that this is a computationally intensive process that cannot practically be performed in the human mind or using pen and paper. Applicant further asserts that the scale of data and complexity of calculations inherent in training a neural network are beyond human cognitive ability (“Remarks”, Page 1).
However, the claims do not recite any size or scale of data. Further, while complex, the determining action-value functions and determining an updated policy limitations therefrom are mathematical calculations, and are therefore being analyzed as abstract ideas. Further, the “obtaining” limitation recites the well-understood, routine, and conventional activity of storing and retrieving information in memory. MPEP 2106.05(d)(II); Versata Dev. Group, Inc. v. SAP Am., Inc., 793 F.3d 1306, 1334, 115 USPQ2d 1681, 1701 (Fed. Cir. 2015).
Applicant further asserts that the claims are similar to that of Example 39, and therefore the 35 U.S.C. 101 rejection should be withdrawn (“Remarks”, Pages 1-2).
However, Example 39 merely provides an example claim which does not recite any abstract ideas. As discussed above, the amended independent claims recite mathematical calculations and therefore abstract ideas. Therefore, the claims are not similar to those of Example 39.
Applicant further asserts that the claims are directed to a specific technical solution to a problem rooted in the field of artificial intelligence, and specifically, how to effectively training a single reinforcement learning agent to balance multiple competing objectives. Applicant further asserts that the claimed architecture in which a final policy is determined is a specific improvement to the functioning of a computer itself, making the computer more efficient (“Remarks”, Pages 2-3).
However, as discussed in MPEP 2106.05(f), "claiming the improved speed or efficiency inherent with applying the abstract idea on a computer" does not integrate a judicial exception into a practical application or provide an inventive concept. Therefore, an improved efficiency with respect to applying the abstract ideas discussed above does not integrate a judicial exception into a practical application or provide an inventive concept.
Regarding the rejection applied under 35 U.S.C. 102/103, Applicant’s arguments with respect to claims have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. 



Conclusion

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KURT NICHOLAS PRESSLY whose telephone number is (703)756-4639. The examiner can normally be reached M-F 8-4.

Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached at (571) 272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.

Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/KURT NICHOLAS PRESSLY/           Examiner, Art Unit 2125      

/KAMRAN AFSHAR/           Supervisory Patent Examiner, Art Unit 2125
Read full office action
Prosecution Timeline

Aug 03, 2022
Application Filed
Aug 03, 2022
Response after Non-Final Action
Jun 26, 2025
Non-Final Rejection — §101, §102, §103
Oct 30, 2025
Response Filed
Jan 22, 2026
Final Rejection — §101, §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/071,063
Patent 12585913
METHOD AND APPARATUS WITH NEURAL NETWORK CONVOLUTION OPERATION
2y 5m to grant Granted Mar 24, 2026
17/346,147
Patent 12580045
Smart qPCR
2y 5m to grant Granted Mar 17, 2026
17/369,678
Patent 12571938
MACHINE LEARNING WORKFLOW FOR PREDICTING HYDRAULIC FRACTURE INITIATION
2y 5m to grant Granted Mar 10, 2026
17/345,352
Patent 12530575
INTELLIGENT AND ADAPTIVE COMPLEX EVENT PROCESSOR FOR A CLOUD-BASED PLATFORM
2y 5m to grant Granted Jan 20, 2026
17/407,219
Patent 12499388
METHOD AND SYSTEM FOR MULTI-SENSOR FUSION USING TRANSFORM LEARNING
2y 5m to grant Granted Dec 16, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
26%
Grant Probability
28%
With Interview (+2.3%)
4y 8m
Median Time to Grant
Moderate
PTA Risk
Based on 23 resolved cases by this examiner. Grant probability derived from career allow rate.
TRAINING MULTI-OBJECTIVE NEURAL NETWORK REINFORCEMENT LEARNING SYSTEMS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email