Office Action Analysis: 18029992 — CONSTRAINED REINFORCEMENT LEARNING NEURAL NETWORK SYSTEMS USING PARETO FRONT OPTIMIZATION

Office Action

§101 §102 §112
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This action is in response to the application filed on 4/03/2023. Claims 1-13 and 15-21 are pending and have been examined.

Information Disclosure Statement
The information disclosure statements (IDS) submitted on 1/09/2024, 11/01/2024, and 3/24/2025 are in compliance with the provisions of 37 CFR 1.97, 1.98, and MPEP § 609. They has been placed in the application file, and the information referred to therein has been considered as to the merits.
Drawings
The drawings are objected to because in Figure 2 COST(S) is shown with reference number 108 when in the specification and other drawings it is designated with reference number 110.  Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure number of an amended drawing should not be labeled as “amended.” If a drawing figure is to be canceled, the appropriate figure must be removed from the replacement sheet, and where necessary, the remaining figures must be renumbered and appropriate changes made to the brief description of the several views of the drawings for consistency. Additional replacement sheets may be necessary to show the renumbering of the remaining figures. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.

Claim Objections
Claims 4 and 19 are objected to because of the following informalities: Claims 4 and 19 recite “using the constraint fitting function for each constraint to determining the constraint” which should be “using the constraint fitting function for each constraint to determine the constraint”.  Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 9 and 16 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Regarding Claims 9 and 16:
Claim 9 recites "the improved action policy" in the last line.  There is insufficient antecedent basis for this limitation in the claim. There is no improved action policy anywhere before this limitation in Claim 9, Claim 9 does recite “an improved action selection policy” but not an improved action policy.
Claim 16 recites “the method comprising” in the last line of the first paragraph. There is insufficient antecedent basis for this limitation in the claim.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-13 and 15-21 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.

Regarding Claim 1:
Step 1:
The claim recites a method which is one of the four statutory categories of patentable subject matter. 
Step 2A prong 1:
The claim recites an abstract idea controlling an agent to perform one or more tasks while interacting with an environment, each task having at least one respective reward associated with performance of the task, and wherein performing the one or more tasks is subject to one or more constraints on one or more respective costs which amounts to a mental process as it can be performed in a human mind.
The claim recites an abstract idea selecting, at each of a plurality of time steps, an action to be performed by the agent in the environment by receiving an observation from the environment and a vector of preference settings sampled from the preference distribution defined by the preference neural network… using an output of the action selection policy neural network to select the action to be performed by the agent, and receiving one or more rewards and one or more costs which amounts to a mental process as it can be performed in a human mind.
Step 2A prong 2:
The additional element of maintaining an action selection policy neural network, having a plurality of action selection policy neural network parameters, for selecting an action in response to an observation from the environment, wherein the action selection policy neural network is conditioned on a preference vector input comprising a vector of preference settings each defining a relative priority of one of the rewards or costs with respect to the others does not integrate the abstract idea into practical application because maintaining a neural network model in storage is considered an insignificant extra solution activity of MPEP 2106.05(g). 
The additional element of maintaining a preference neural network, having a plurality of preference neural network parameters, defining a preference distribution for the vector of preference settings does not integrate the abstract idea into practical application because maintaining a neural network model in storage is considered an insignificant extra solution activity of MPEP 2106.05(g). 
The additional element of providing the vector of preference settings and data from the observation to the action selection policy neural network does not integrate the abstract idea into practical application because providing data to a neural network is considered an insignificant extra solution activity of “mere data gathering” MPEP 2106.05(g).
The additional element of fixing the preference neural network parameters and adjusting the action selection policy neural network parameters to jointly optimize a set of target objectives, wherein each of the one or more rewards and each of the one or more costs has a respective one of the target objectives is generally linked to the abstract idea, therefore does not integrate the abstract idea into practical application MPEP 2106.05(h).
The additional element of fixing the action selection policy neural network parameters and adjusting the preference neural network parameters to optimize a constraint objective dependent on whether each of the one or more constraints is satisfied by the one or more respective costs is generally linked to the abstract idea, therefore does not integrate the abstract idea into practical application MPEP 2106.05(h).
Step 2B:
The additional element of maintaining an action selection policy neural network, having a plurality of action selection policy neural network parameters, for selecting an action in response to an observation from the environment, wherein the action selection policy neural network is conditioned on a preference vector input comprising a vector of preference settings each defining a relative priority of one of the rewards or costs with respect to the others does not integrate the abstract idea into practical application because the additional element is an insignificant extra solution activity and further is a well understood routine and conventional activity. See MPEP 2106.05(d)(II)(iv), (Storing and retrieving information in memory, Versata Dev. Group, Inc. v. SAP Am., Inc., 793 F.3d 1306, 1334, 115 USPQ2d 1681, 1701 (Fed. Cir. 2015)).
The additional element of maintaining a preference neural network, having a plurality of preference neural network parameters, defining a preference distribution for the vector of preference settings does not integrate the abstract idea into practical application because the additional element is an insignificant extra solution activity and further is a well understood routine and conventional activity. See MPEP 2106.05(d)(II)(iv), (Storing and retrieving information in memory, Versata Dev. Group, Inc. v. SAP Am., Inc., 793 F.3d 1306, 1334, 115 USPQ2d 1681, 1701 (Fed. Cir. 2015)).
The additional element of providing the vector of preference settings and data from the observation to the action selection policy neural network does not amount to significantly more because the additional element is an insignificant extra solution activity and further is a well understood routine and conventional activity. See MPEP 2106.05(d)(II)(i), (buySAFE, Inc. v. Google, Inc., 765 F.3d 1350, 1355, 112 USPQ2d 1093, 1096 (Fed. Cir. 2014) (computer receives and sends information over a network)).
The additional element of fixing the preference neural network parameters and adjusting the action selection policy neural network parameters to jointly optimize a set of target objectives, wherein each of the one or more rewards and each of the one or more costs has a respective one of the target objectives is generally linked to the abstract idea, therefore does not amount to significantly more MPEP 2106.05(h).
The additional element of fixing the action selection policy neural network parameters and adjusting the preference neural network parameters to optimize a constraint objective dependent on whether each of the one or more constraints is satisfied by the one or more respective costs is generally linked to the abstract idea, therefore does not amount to significantly more MPEP 2106.05(h).
Therefore, the claim is ineligible.

Regarding Claim 2:
Claim 2 incorporates the rejection of Claim 1. The claim does not recite any additional elements that integrate the abstract idea into practical application or amount to significantly more. Specifically, the claim recites a further additional element the action selection policy neural network parameters are adjusted to jointly optimize a set of target objectives towards the Pareto front, and wherein the set of objectives is optimized towards a region of the Pareto front in which the constraints are satisfied which is generally linked to the abstract idea MPEP 2106.05(h). The claim is ineligible.

Regarding Claim 3:
Claim 3 which incorporates the rejection of Claim 1, recites a further abstract idea determining the constraint objective by determining, for each of the one or more constraints, a value of a respective constraint fitting function, dependent on the vector of preference settings, that evaluates whether the respective constraint is satisfied by the respective cost which is a mathematical concept. The claim does not recite any additional elements that integrate the abstract idea into practical application or amount to significantly more. The claim is ineligible.

Regarding Claim 4:
Claim 4 which incorporates the rejection of Claim 1, recites a further abstract idea determining a value of a constraint fitting function for a constraint from the Q-value output of the Q-value neural network for the target objective of the cost constrained by the constraint, averaged over observations and actions and using the constraint fitting function for each constraint to determining the constraint objective which are mathematical operations. The claim does not recite any additional elements that integrate the abstract idea into practical application or amount to significantly more. Specifically, the claim recites a further additional element maintaining, for each target objective, a respective Q-value neural network having a plurality of Q-value neural network parameters and a Q-value output for defining an action-value score dependent upon an action, an observation, and a vector of preference settings which is an insignificant extra solution activity of storing and retrieving data from memory MPEP 2106.05(g) and further is a well understood routine and conventional activity MPEP 2106.05(d)(II)(iv), (Storing and retrieving information in memory, Versata Dev. Group, Inc. v. SAP Am., Inc., 793 F.3d 1306, 1334, 115 USPQ2d 1681, 1701 (Fed. Cir. 2015)). The claim is ineligible.

Regarding Claim 5:
Claim 5 which incorporates the rejection of Claim 4, recites a further abstract idea the value of the constraint fitting function is dependent upon a difference between the Q-value output of the Q-value neural network for the target objective of the cost constrained by the constraint, and a constraint value for the constraint which is a mathematical operation. The claim does not recite any additional elements that integrate the abstract idea into practical application or amount to significantly more. The claim is ineligible.

Regarding Claim 6:
Claim 6 which incorporates the rejection of Claim 1, recites a further abstract idea sampling a vector of preference settings from the preference distribution defined by the preference neural network whilst the preference neural network parameters are fixed which amounts to a mental process as it can be performed in a human mind. The claim does not recite any additional elements that integrate the abstract idea into practical application or amount to significantly more. Specifically, the claim recites a further additional element adjusting the action selection policy neural network parameters to jointly optimize the set of target objectives according to relative preferences amongst the target objectives defined by the relative priorities of the rewards or constraints associated with the target objectives which are defined by the sampled vector of preference settings which is generally linked to the abstract idea MPEP 2106.05(h). The claim is ineligible.

Regarding Claim 7:
Claim 7 incorporates the rejection of Claim 1. The claim does not recite any additional elements that integrate the abstract idea into practical application or amount to significantly more. Specifically, the claim recites a further additional element maintaining, for each target objective, a respective Q-value neural network having a plurality of Q-value neural network parameters and a Q-value output for defining an action-value score dependent upon an action, an observation, and a vector of preference settings which is an insignificant extra solution activity of storing and retrieving data from memory MPEP 2106.05(g) and further is a well understood routine and conventional activity MPEP 2106.05(d)(II)(iv), (Storing and retrieving information in memory, Versata Dev. Group, Inc. v. SAP Am., Inc., 793 F.3d 1306, 1334, 115 USPQ2d 1681, 1701 (Fed. Cir. 2015)). Additionally the claim recites a further additional element using each Q-value neural network to optimize the respective target objective for each of the one or more rewards and of the one or more costs which is generally linked to the abstract idea MPEP 2106.05(h). The claim is ineligible.

Regarding Claim 8:
Claim 8 which incorporates the rejection of Claim 7, recites a further abstract idea sampling a vector of preference settings from the preference distribution defined by the preference neural network whilst the preference neural network parameters are fixed which amounts to a mental process as it can be performed in a human mind. The claim does not recite any additional elements that integrate the abstract idea into practical application or amount to significantly more. Specifically, the claim recites a further additional element providing the sampled vector of preference settings to each Q-value neural network which is an insignificant extra solution activity of “mere data gathering” MPEP 2106.05(g) and further is a well understood routine and conventional activity MPEP 2106.05(d)(II)(i), (buySAFE, Inc. v. Google, Inc., 765 F.3d 1350, 1355, 112 USPQ2d 1093, 1096 (Fed. Cir. 2014) (computer receives and sends information over a network)). Additionally, the claim recites further additional elements adjusting the Q-value neural network parameters of each Q-value neural network to optimize a Q-learning objective dependent upon a measure of a difference between the Q-value output and a respective Q-value target, wherein the respective Q-value target is determined in accordance with an action selection policy defined by a current value of the action selection policy neural network parameters and adjusting the action selection policy neural network parameters using the Q-value output of each Q-value neural network after the Q-value neural network parameters have been adjusted which are generally linked to the abstract idea MPEP 2106.05(h). The claim is ineligible.

Regarding Claim 9:
Claim 9 which incorporates the rejection of Claim 8, recites a further abstract idea determining, for each target objective, an improved action selection policy wherein the improved action selection policy is determined by scaling the output of the action selection policy neural network by a policy improvement factor for the target objective dependent upon the Q-value output from the Q-value neural network for the target objective which amounts to a mental process as it can be performed in a human mind. The claim does not recite any additional elements that integrate the abstract idea into practical application or amount to significantly more. Specifically, the claim recites a further additional element adjusting the action selection policy neural network parameters to optimize a combined policy improvement objective dependent upon a measure of a difference between a current action selection policy defined by current parameters of the action selection policy neural network and the improved action policy for each target objective which is generally linked to the abstract idea MPEP 2106.05(h). The claim is ineligible.

Regarding Claim 10:
Claim 10 incorporates the rejection of Claim 9. The claim further recites a description of the current action selection policy, the improved action policy, and the measure of a difference from the determining and adjusting steps of Claim 9 and is ineligible for the same reasons as set forth in Claim 9. The claim does not recite any additional elements that integrate the abstract idea into practical application or amount to significantly more. The claim is ineligible.
Regarding Claim 11:
Claim 11 which incorporates the rejection of Claim 9, recites a further abstract idea obtaining the temperature parameter by optimizing a temperature objective function dependent upon, for each target objective, the Q-value output from the Q-value neural network for the target objective averaged over preference settings sampled from the preference distribution, observations received from the environment, and actions for the observations sampled from the current action selection policy which is a mathematical operation. The claim does not recite any additional elements that integrate the abstract idea into practical application or amount to significantly more. The claim is ineligible.

Regarding Claim 12:
Claim 12 which incorporates the rejection of Claim 1, recites a further abstract idea initializing the preference distribution such that a sampled vector of preference settings sampled from the preference distribution is likely to prioritize costs over rewards which is a mental process as it can be performed in a human mind. The claim does not recite any additional elements that integrate the abstract idea into practical application or amount to significantly more. The claim is ineligible.

Regarding Claim 13:
Claim 13 incorporates the rejection of Claim 1. The claim further recites a description of the environment and constraints from the controlling an agent step of Claim 1 and is ineligible for the same reasons as set forth in Claim 1. The claim does not recite any additional elements that integrate the abstract idea into practical application or amount to significantly more. The claim is ineligible.

Regarding Claim 15:
Step 1:
The claim recites a non-transitory computer-readable storage media which is one of the four statutory categories of patentable subject matter. 
Step 2A prong 1:
The claim recites an abstract idea controlling an agent to perform one or more tasks while interacting with an environment, each task having at least one respective reward associated with performance of the task, and wherein performing the one or more tasks is subject to one or more constraints on one or more respective costs which amounts to a mental process as it can be performed in a human mind.
The claim recites an abstract idea selecting, at each of a plurality of time steps, an action to be performed by the agent in the environment by receiving an observation from the environment and a vector of preference settings sampled from the preference distribution defined by the preference neural network… using an output of the action selection policy neural network to select the action to be performed by the agent, and receiving one or more rewards and one or more costs which amounts to a mental process as it can be performed in a human mind.
Step 2A prong 2:
The additional element of using a non-transitory computer-readable storage media is a generic computer component used to implement the abstract idea, therefore does not integrate the abstract idea into practical application MPEP 2106.05(f).
The additional element of using one or more computers is a generic computer component used to implement the abstract idea, therefore does not integrate the abstract idea into practical application MPEP 2106.05(f).
The additional element of maintaining an action selection policy neural network, having a plurality of action selection policy neural network parameters, for selecting an action in response to an observation from the environment, wherein the action selection policy neural network is conditioned on a preference vector input comprising a vector of preference settings each defining a relative priority of one of the rewards or costs with respect to the others does not integrate the abstract idea into practical application because maintaining a neural network model in storage is considered an insignificant extra solution activity of MPEP 2106.05(g). 
The additional element of maintaining a preference neural network, having a plurality of preference neural network parameters, defining a preference distribution for the vector of preference settings does not integrate the abstract idea into practical application because maintaining a neural network model in storage is considered an insignificant extra solution activity of MPEP 2106.05(g). 
The additional element of providing the vector of preference settings and data from the observation to the action selection policy neural network does not integrate the abstract idea into practical application because providing data to a neural network is considered an insignificant extra solution activity of “mere data gathering” MPEP 2106.05(g).
The additional element of fixing the preference neural network parameters and adjusting the action selection policy neural network parameters to jointly optimize a set of target objectives, wherein each of the one or more rewards and each of the one or more costs has a respective one of the target objectives is generally linked to the abstract idea, therefore does not integrate the abstract idea into practical application MPEP 2106.05(h).
The additional element of fixing the action selection policy neural network parameters and adjusting the preference neural network parameters to optimize a constraint objective dependent on whether each of the one or more constraints is satisfied by the one or more respective costs is generally linked to the abstract idea, therefore does not integrate the abstract idea into practical application MPEP 2106.05(h).
Step 2B:
The additional element of using a non-transitory computer-readable storage media is a generic computer component used to implement the abstract idea, therefore does not amount to significantly more MPEP 2106.05(f).
The additional element of using one or more computers is a generic computer component used to implement the abstract idea, therefore does not amount to significantly more MPEP 2106.05(f).
The additional element of maintaining an action selection policy neural network, having a plurality of action selection policy neural network parameters, for selecting an action in response to an observation from the environment, wherein the action selection policy neural network is conditioned on a preference vector input comprising a vector of preference settings each defining a relative priority of one of the rewards or costs with respect to the others does not integrate the abstract idea into practical application because the additional element is an insignificant extra solution activity and further is a well understood routine and conventional activity. See MPEP 2106.05(d)(II)(iv), (Storing and retrieving information in memory, Versata Dev. Group, Inc. v. SAP Am., Inc., 793 F.3d 1306, 1334, 115 USPQ2d 1681, 1701 (Fed. Cir. 2015)).
The additional element of maintaining a preference neural network, having a plurality of preference neural network parameters, defining a preference distribution for the vector of preference settings does not integrate the abstract idea into practical application because the additional element is an insignificant extra solution activity and further is a well understood routine and conventional activity. See MPEP 2106.05(d)(II)(iv), (Storing and retrieving information in memory, Versata Dev. Group, Inc. v. SAP Am., Inc., 793 F.3d 1306, 1334, 115 USPQ2d 1681, 1701 (Fed. Cir. 2015)).
The additional element of providing the vector of preference settings and data from the observation to the action selection policy neural network does not amount to significantly more because the additional element is an insignificant extra solution activity and further is a well understood routine and conventional activity. See MPEP 2106.05(d)(II)(i), (buySAFE, Inc. v. Google, Inc., 765 F.3d 1350, 1355, 112 USPQ2d 1093, 1096 (Fed. Cir. 2014) (computer receives and sends information over a network)).
The additional element of fixing the preference neural network parameters and adjusting the action selection policy neural network parameters to jointly optimize a set of target objectives, wherein each of the one or more rewards and each of the one or more costs has a respective one of the target objectives is generally linked to the abstract idea, therefore does not amount to significantly more MPEP 2106.05(h).
The additional element of fixing the action selection policy neural network parameters and adjusting the preference neural network parameters to optimize a constraint objective dependent on whether each of the one or more constraints is satisfied by the one or more respective costs is generally linked to the abstract idea, therefore does not amount to significantly more MPEP 2106.05(h).
Therefore, the claim is ineligible.

Regarding Claim 16:
Step 1:
The claim recites a system which is one of the four statutory categories of patentable subject matter. 
Step 2A prong 1:
The claim recites an abstract idea controlling an agent to perform one or more tasks while interacting with an environment, each task having at least one respective reward associated with performance of the task, and wherein performing the one or more tasks is subject to one or more constraints on one or more respective costs which amounts to a mental process as it can be performed in a human mind.
The claim recites an abstract idea selecting, at each of a plurality of time steps, an action to be performed by the agent in the environment by receiving an observation from the environment and a vector of preference settings sampled from the preference distribution defined by the preference neural network… using an output of the action selection policy neural network to select the action to be performed by the agent, and receiving one or more rewards and one or more costs which amounts to a mental process as it can be performed in a human mind.
Step 2A prong 2:
The additional element of using one or more computers is a generic computer component used to implement the abstract idea, therefore does not integrate the abstract idea into practical application MPEP 2106.05(f).
The additional element of using one or more storage devices is a generic computer component used to implement the abstract idea, therefore does not integrate the abstract idea into practical application MPEP 2106.05(f).
The additional element of maintaining an action selection policy neural network, having a plurality of action selection policy neural network parameters, for selecting an action in response to an observation from the environment, wherein the action selection policy neural network is conditioned on a preference vector input comprising a vector of preference settings each defining a relative priority of one of the rewards or costs with respect to the others does not integrate the abstract idea into practical application because maintaining a neural network model in storage is considered an insignificant extra solution activity of MPEP 2106.05(g). 
The additional element of maintaining a preference neural network, having a plurality of preference neural network parameters, defining a preference distribution for the vector of preference settings does not integrate the abstract idea into practical application because maintaining a neural network model in storage is considered an insignificant extra solution activity of MPEP 2106.05(g). 
The additional element of providing the vector of preference settings and data from the observation to the action selection policy neural network does not integrate the abstract idea into practical application because providing data to a neural network is considered an insignificant extra solution activity of “mere data gathering” MPEP 2106.05(g).
The additional element of fixing the preference neural network parameters and adjusting the action selection policy neural network parameters to jointly optimize a set of target objectives, wherein each of the one or more rewards and each of the one or more costs has a respective one of the target objectives is generally linked to the abstract idea, therefore does not integrate the abstract idea into practical application MPEP 2106.05(h).
The additional element of fixing the action selection policy neural network parameters and adjusting the preference neural network parameters to optimize a constraint objective dependent on whether each of the one or more constraints is satisfied by the one or more respective costs is generally linked to the abstract idea, therefore does not integrate the abstract idea into practical application MPEP 2106.05(h).
Step 2B:
The additional element of using one or more computers is a generic computer component used to implement the abstract idea, therefore does not amount to significantly more MPEP 2106.05(f).
The additional element of using one or more storage devices is a generic computer component used to implement the abstract idea, therefore does not amount to significantly more MPEP 2106.05(f).
The additional element of maintaining an action selection policy neural network, having a plurality of action selection policy neural network parameters, for selecting an action in response to an observation from the environment, wherein the action selection policy neural network is conditioned on a preference vector input comprising a vector of preference settings each defining a relative priority of one of the rewards or costs with respect to the others does not integrate the abstract idea into practical application because the additional element is an insignificant extra solution activity and further is a well understood routine and conventional activity. See MPEP 2106.05(d)(II)(iv), (Storing and retrieving information in memory, Versata Dev. Group, Inc. v. SAP Am., Inc., 793 F.3d 1306, 1334, 115 USPQ2d 1681, 1701 (Fed. Cir. 2015)).
The additional element of maintaining a preference neural network, having a plurality of preference neural network parameters, defining a preference distribution for the vector of preference settings does not integrate the abstract idea into practical application because the additional element is an insignificant extra solution activity and further is a well understood routine and conventional activity. See MPEP 2106.05(d)(II)(iv), (Storing and retrieving information in memory, Versata Dev. Group, Inc. v. SAP Am., Inc., 793 F.3d 1306, 1334, 115 USPQ2d 1681, 1701 (Fed. Cir. 2015)).
The additional element of providing the vector of preference settings and data from the observation to the action selection policy neural network does not amount to significantly more because the additional element is an insignificant extra solution activity and further is a well understood routine and conventional activity. See MPEP 2106.05(d)(II)(i), (buySAFE, Inc. v. Google, Inc., 765 F.3d 1350, 1355, 112 USPQ2d 1093, 1096 (Fed. Cir. 2014) (computer receives and sends information over a network)).
The additional element of fixing the preference neural network parameters and adjusting the action selection policy neural network parameters to jointly optimize a set of target objectives, wherein each of the one or more rewards and each of the one or more costs has a respective one of the target objectives is generally linked to the abstract idea, therefore does not amount to significantly more MPEP 2106.05(h).
The additional element of fixing the action selection policy neural network parameters and adjusting the preference neural network parameters to optimize a constraint objective dependent on whether each of the one or more constraints is satisfied by the one or more respective costs is generally linked to the abstract idea, therefore does not amount to significantly more MPEP 2106.05(h).
Therefore, the claim is ineligible.

Regarding Claim 17:
Claim 17 incorporates the rejection of Claim 16. The claim does not recite any additional elements that integrate the abstract idea into practical application or amount to significantly more. Specifically, the claim recites a further additional element the action selection policy neural network parameters are adjusted to jointly optimize a set of target objectives towards the Pareto front, and wherein the set of objectives is optimized towards a region of the Pareto front in which the constraints are satisfied which is generally linked to the abstract idea MPEP 2106.05(h). The claim is ineligible.

Regarding Claim 18:
Claim 18 which incorporates the rejection of Claim 16, recites a further abstract idea determining the constraint objective by determining, for each of the one or more constraints, a value of a respective constraint fitting function, dependent on the vector of preference settings, that evaluates whether the respective constraint is satisfied by the respective cost which is a mathematical concept. The claim does not recite any additional elements that integrate the abstract idea into practical application or amount to significantly more. The claim is ineligible.

Regarding Claim 19:
Claim 19 which incorporates the rejection of Claim 16, recites a further abstract idea determining a value of a constraint fitting function for a constraint from the Q-value output of the Q-value neural network for the target objective of the cost constrained by the constraint, averaged over observations and actions and using the constraint fitting function for each constraint to determining the constraint objective which are mathematical operations. The claim does not recite any additional elements that integrate the abstract idea into practical application or amount to significantly more. Specifically, the claim recites a further additional element maintaining, for each target objective, a respective Q-value neural network having a plurality of Q-value neural network parameters and a Q-value output for defining an action-value score dependent upon an action, an observation, and a vector of preference settings which is an insignificant extra solution activity of storing and retrieving data from memory MPEP 2106.05(g) and further is a well understood routine and conventional activity MPEP 2106.05(d)(II)(iv), (Storing and retrieving information in memory, Versata Dev. Group, Inc. v. SAP Am., Inc., 793 F.3d 1306, 1334, 115 USPQ2d 1681, 1701 (Fed. Cir. 2015)). The claim is ineligible.

Regarding Claim 20:
Claim 20 which incorporates the rejection of Claim 19, recites a further abstract idea the value of the constraint fitting function is dependent upon a difference between the Q-value output of the Q-value neural network for the target objective of the cost constrained by the constraint, and a constraint value for the constraint which is a mathematical operation. The claim does not recite any additional elements that integrate the abstract idea into practical application or amount to significantly more. The claim is ineligible.

Regarding Claim 21:
Claim 21 which incorporates the rejection of Claim 16, recites a further abstract idea sampling a vector of preference settings from the preference distribution defined by the preference neural network whilst the preference neural network parameters are fixed which amounts to a mental process as it can be performed in a human mind. The claim does not recite any additional elements that integrate the abstract idea into practical application or amount to significantly more. Specifically, the claim recites a further additional element adjusting the action selection policy neural network parameters to jointly optimize the set of target objectives according to relative preferences amongst the target objectives defined by the relative priorities of the rewards or constraints associated with the target objectives which are defined by the sampled vector of preference settings which is generally linked to the abstract idea MPEP 2106.05(h). The claim is ineligible.


Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-13 and 15-21 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Hasan et al. “Dynamic multi-objective optimisation using deep reinforcement learning: benchmark, algorithm and an application to identify vulnerable zones based on water quality”, from applicant IDS, hereinafter “Hasan”.

Regarding Claim 1, Hasan teaches:
A computer implemented (Hasan uses python with libraries demonstrating that Hasan performs their method on a computer, p. 123, col. 1, paragraph 1, “created based on MATLAB (2018a) with python version (3.5) with the libraries (i.e. tensorflow and keras)”) method of controlling an agent to perform one or more tasks while interacting with an environment (p. 107, Abstract, “a dynamic multiobjective testbed has been created which is a modified version of the conventional deep-sea treasure (DST) hunt testbed. This modified testbed fulfils the changing aspects of the dynamic environment in terms of the characteristics where the changes occur based on time”), each task having at least one respective reward associated with performance of the task (p. 115, Equation 13 showing reward function with rewards for a task), and wherein performing the one or more tasks is subject to one or more constraints on one or more respective costs (cost is penalty values and constraint is maximizing reward, p. 113, col. 2, paragraph 4, “This health parameter creates another objective that needs to be satisfied over time to survive” and “These two submarines can damage the health of the agent by −2”, p. 117, col. 1, paragraph 6, “the agent gets −0.04 point for every step until it gets the vulnerable zone where it gets +1 as a reward and −1 for the resilient zone”), the method comprising:
maintaining an action selection policy neural network (Figure 17 bottom of figure, Action and ORM are action selection policy neural network, ORM is neural network as shown in p. 122, Figure 15 consisting of four different layers), having a plurality of action selection policy neural network parameters (p. 120, paragraph 2, “Fig. 15 shows the relation mapping for different objectives based on selected policies where the convolutional layers direct to form a compromised solution. This process is predominantly executed by the neural network of the deep layer and by adjusting the weights and bias for each neuron”), for selecting an action(p. 123, bottom of Figure 17, argmax is used on values from ORM to select action) in response to an observation from the environment (Figure 17 environment state which is observations leads to action selection policy neural network), wherein the action selection policy neural network is conditioned on a preference vector input comprising a vector of preference settings each defining a relative priority of one of the rewards or costs with respect to the others (Figure 17, preference vector input is p1-pn, each parity value defines relative priority, p. 120, col. 2, paragraph 1, “balancing the objectives using objective relation mapping based on a dynamic weight (parity value)”); 
maintaining a preference neural network, having a plurality of preference neural network parameters (Figure 17 DQNs are preference neural network, p. 124, Algorithm, line 20, “Train the Q network with pre-defined loss”, DQNs are trained showing they have parameters), defining a preference distribution for the vector of preference settings (vector of preference settings influence outputs of DQNs combined into a Q value, distribution is shown in different selection of dynamic weights (parity values) per episode, “The obtained Q value by the agent is the average value from the DQNs that characterise all the objectives. Therefore, the selected objectives are the representation of the compound structure of all the objectives. Needless to mention, this value is made by the Q values in a finite horizon. In other words, this can be represented as the most compromising solutions that the agent could achieve in a particular episode.”); 
selecting, at each of a plurality of time steps, an action to be performed by the agent in the environment by receiving an observation from the environment and a vector of preference settings sampled from the preference distribution defined by the preference neural network, providing the vector of preference settings and data from the observation to the action selection policy neural network(p. 119, col. 1, paragraph 2, “The agent starts from the start state 𝑆1…”, p. 120, col. 1, paragraph 8, “The set of 𝑄(𝑠, 𝑎) is updated with vectors from 𝑆 ′ at time step t for the first time after performing an action 𝑎 in state𝑆”, p. 123, Figure 17 shows action after pipeline of environment state observation and creating vector of preference values p inputted into action selection policy neural network (ORM and Action blocks)), using an output of the action selection policy neural network to select the action to be performed by the agent (p. 123, Figure 17 shows Action for agent as output into learning module), and receiving one or more rewards and one or more costs (p. 117, col. 1, paragraph 7, “the agent gets −0.04 point for every step until it gets the vulnerable zone where it gets +1 as a reward”, p. 113, col. 2, paragraph 4, “These two submarines can damage the health of the agent by −2”); 
fixing the preference neural network parameters (parameters are fixed after training, p. 124, Algorithm, line 20, “Train the Q network with pre-defined loss”) and adjusting the action selection policy neural network parameters to jointly optimize a set of target objectives, wherein each of the one or more rewards and each of the one or more costs has a respective one of the target objectives (p. 120, col. 2, paragraph 7, “Fig. 15 shows the relation mapping for different objectives based on selected policies… by adjusting the weights and bias for each neuron… determines the compromising solutions based on the balance of the several objectives. These Q values are forwarded by the DQNs which consist of the set of state and action values for a particular episode that is generated by the emulator. In the deep layer, the weights of the neural network are adjusted based on the backpropagation procedure”, Q value of each objective, p. 122, col. 2, paragraph 1, “Q values for the different objectives by averaging them or multiplying with the preference values”, p. 113, col. 2, paragraph 4, “damage the health of the agent by −2. This health parameter creates another objective that needs to be satisfied”); and 
fixing the action selection policy neural network parameters (parameters are fixed after adjustment, p. 120, col. 2, paragraph 7, “Fig. 15 shows the relation mapping for different objectives based on selected policies… by adjusting the weights and bias for each neuron”) and adjusting the preference neural network parameters to optimize a constraint objective dependent on whether each of the one or more constraints is satisfied by the one or more respective costs (parameters are adjusted during training, p. 124, Algorithm, line 20, “Train the Q network with pre-defined loss”, line 16 shows optimized to constraint objective where tt encompasses constraint satisfied by cost rr).

Regarding Claim 2, Hasan teaches the method of Claim 1 as referenced above. Hasan further teaches:
wherein the action selection policy neural network parameters are adjusted to jointly optimize a set of target objectives towards the Pareto front, and wherein the set of objectives is optimized towards a region of the Pareto front in which the constraints are satisfied (reward space encompass both cost and reward, p. 117, col. 1, paragraph 6, “the agent gets −0.04 point for every step until it gets the vulnerable zone where it gets +1 as a reward and −1 for the resilient zone” objective is optimized with this being satisfied, p. 113, col. 2, paragraph 6, “An agent’s objective is to find out all the state vectors for a problem where these defines a non-convex Pareto frontier in reward space”, p. 120, col. 2, paragraph 7, “The agent selects an action 𝑎 in the state s to form the vector of 𝑄⃗(𝑠, 𝑎) that is composed with the reward vectors ⃗𝑟𝑡 = [ ⃗𝑟1 , ⃗𝑟2 , … ⃗𝑟𝑛 ]”, p. 113, col. 1, paragraph 5, “The optimal Pareto front has 10 non-dominated solutions, one per each treasure in the grid”, p. 120, col. 2, paragraph 7, “Update the Pareto Front (PF) position by the following Eq. (23):”).


Regarding Claim 3, Hasan teaches the method of Claim 1 as referenced above. Hasan further teaches:
comprising determining the constraint objective (p. 113, col. 2, paragraph 4, “two submarines can damage the health of the agent by −2. This health parameter creates another objective that needs to be satisfied over time to survive”) by determining, for each of the one or more constraints, a value of a respective constraint fitting function (tt is constraint fitting function computing Q value that includes reward or cost constraint, p. 124, Algorithm, line 17, tt=rr+yMaxa’ Q(ss’, aa’))), dependent on the vector of preference settings (preference values are vector of preference settings, tt is conditioned on these preference values, p. 122, col. 2, paragraph 1, “This module balances the Q values to achieve the possible optimal Q values for the different objectives by averaging them or multiplying with the preference values”), that evaluates whether the respective constraint is satisfied by the respective cost (p. 122, col. 2, paragraph 2, “the agent ensures that the traversing is completed to visit all the nodes so that there is no unexplored state. Thus, the agent interacts within the environment and learns the optimum values which lead to select the policy”).

Regarding Claim 4, Hasan teaches the method of Claim 1 as referenced above. Hasan further teaches:
maintaining, for each target objective, a respective Q-value neural network having a plurality of Q-value neural network parameters (p. 120, col. 2, last paragraph, “The obtained Q value by the agent is the average value from the DQNs that characterize all the objectives. Therefore, the selected objectives are the representation of the compound structure of all the objectives”, p.123, Figure 17 plurality of DQNs) and a Q-value output for defining an action-value score dependent upon an action, an observation, and a vector of preference settings (Action score is argmax of Q values, p. 122, col. 2, paragraph 3, “argmax and finally select the action”, p. 122, col. 1, paragraph 6, “The state is selected based on the DQN networks and this value is sent to a stack where the agent looks for the best Q value. After that, the highest value is mapped with the ORM module where it forwards the best compromising objectives after satisfying the constraints. This module balances the Q values to achieve the possible optimal Q values for the different objectives by averaging them or multiplying with the preference values (e.g., if any)”, preference values is vector of preference settings);
determining a value of a constraint fitting function for a constraint from the Q-value output of the Q-value neural network for the target objective of the cost constrained by the constraint (rr is reward which is constraint such as health, tt is value of constraint fitting function, p. 124, Algorithm, Line 6, “observe reward r and new state s’”, line 17, “tt=rr+yMaxa’ Q(ss’, aa’)”), averaged over observations and actions (averaged over mini batches, p. 123, col. 2, paragraph 3, “To stabilise the learning, mini batches of 32 is used where the number of training chunk for each Adam update is computed. The replay memory size is set to 10 K where the Adam are sampled from recent actions”); and 
using the constraint fitting function for each constraint to determining the constraint objective (p. 124, Algorithm, line 17 computes the target q value for an action with constraint fitting function, line 25 creates stacks of Q values for each objective which is determining the constraint objective).  


Regarding Claim 5, Hasan teaches the method of Claim 4 as referenced above. Hasan further teaches:
wherein the value of the constraint fitting function is dependent upon a difference between the Q-value output of the Q-value neural network for the target objective of the cost constrained by the constraint, and a constraint value for the constraint (rr is constraint value for constraint such as -2 for health damage, Q-value output is Q(ss’, aa’), p. 124, Algorithm, line 17, “tt=rr+yMaxa’ Q(ss’, aa’)”).  

Regarding Claim 6, Hasan teaches the method of Claim 1 as referenced above. Hasan further teaches:
sampling a vector of preference settings from the preference distribution defined by the preference neural network (p.123, Figure 17, preference vectors are sampled when mapped with ORM module, p. 122, col. 1, paragraph 6, “After that, the highest value is mapped with the ORM module where it forwards the best compromising objectives after satisfying the constraints. This module balances the Q values to achieve the possible optimal Q values for the different objectives by averaging them or multiplying with the preference values”) whilst the preference neural network parameters are fixed (parameters are fixed after training, p. 124, Algorithm, line 20, “Train the Q network with pre-defined loss”); and
adjusting the action selection policy neural network parameters to jointly optimize the set of target objectives according to relative preferences amongst the target objectives defined by the relative priorities of the rewards or constraints associated with the target objectives which are defined by the sampled vector of preference settings (p. 120, col. 2, paragraph 7, “Fig. 15 shows the relation mapping for different objectives based on selected policies… by adjusting the weights and bias for each neuron… determines the compromising solutions based on the balance of the several objectives. These Q values are forwarded by the DQNs which consist of the set of state and action values for a particular episode that is generated by the emulator. In the deep layer, the weights of the neural network are adjusted based on the backpropagation procedure”, p. 122, col. 2, paragraph 1, “Q values for the different objectives by averaging them or multiplying with the preference values”).  

Regarding Claim 7, Hasan teaches the method of Claim 1 as referenced above. Hasan further teaches:
maintaining, for each target objective, a respective Q-value neural network having a plurality of Q-value neural network parameters (p. 123, Figure 17 plurality of DQNs) and a Q-value output for defining an action-value score dependent upon an action, an observation (Q value outputs define a value computed from an action and observation, p. 124, Algorithm line 25, Q1(s1, a1)), and a vector of preference settings (p. 122, col. 1, paragraph 6, “ORM module where it forwards the best compromising objectives after satisfying the constraints. This module balances the Q values to achieve the possible optimal Q values for the different objectives… multiplying with the preference values”); and
using each Q-value neural network to optimize the respective target objective for each of the one or more rewards and of the one or more costs (p. 122, col. 2, paragraph 3, “The target of our proposed algorithm is to detect the changes and then tracking the changing optima… From the below algorithm 1, it is noticeable that we need to provide vector rewards for each action and prioritise the objectives (i.e. if needed as like as the DST-enemy attack environment to prioritise health more than treasure)… the agent needs to convey the state–action pair into a deep Q network and get the highest achieved Q value (multiply with the preference value—if any)”).  

Regarding Claim 8, Hasan teaches the method of Claim 7 as referenced above. Hasan further teaches:
wherein adjusting the action selection policy neural network parameters to jointly optimize the set of target objectives comprises:
sampling a vector of preference settings from the preference distribution defined by the preference neural network (p. 123, Figure 17, preference vectors are sampled when mapped with ORM module, p. 122, col. 1, paragraph 6, “After that, the highest value is mapped with the ORM module where it forwards the best compromising objectives after satisfying the constraints. This module balances the Q values to achieve the possible optimal Q values for the different objectives by averaging them or multiplying with the preference values”) whilst the preference neural network parameters are fixed (parameters are fixed after training, p. 124, Algorithm, line 20, “Train the Q network with pre-defined loss”); 
providing the sampled vector of preference settings to each Q-value neural network (preference vector is provided to Q-values from each DQN, p. 122, col. 2, paragraph 3, “a deep Q network and get the highest achieved Q value (multiply with the preference value—if any)”); 
adjusting the Q-value neural network parameters of each Q-value neural network to optimize a Q-learning objective dependent upon a measure of a difference between the Q-value output and a respective Q-value target, wherein the respective Q-value target is determined in accordance with an action selection policy defined by a current value of the action selection policy neural network parameters (p. 124, Algorithm, Line 17 shows target Q value calculated, line 20 shows training Q network with predefined loss, minimizing a loss function will involve difference between calculated Q value target and Q value output); and 
adjusting the action selection policy neural network parameters using the Q-value output of each Q-value neural network after the Q-value neural network parameters have been adjusted (p. 124, Algorithm, lines 39-42 of the algorithm show updating target network Q based on output and selection).  

Regarding Claim 9, Hasan teaches the method of Claim 8 as referenced above. Hasan further teaches:
wherein adjusting the action selection policy neural network parameters using Q-value outputs from the Q-value neural networks (p. 122, col. 1, paragraph 6, “the ORM module where it forwards the best compromising objectives after satisfying the constraints. This module balances the Q values to achieve the possible optimal Q values for the different objectives”, p. 120, paragraph 2, “Fig. 15 shows the relation mapping for different objectives based on selected policies where the convolutional layers direct to form a compromised solution… adjusting the weights and bias for each neuron) comprises:
determining, for each target objective, an improved action selection policy wherein the improved action selection policy is determined by scaling the output of the action selection policy neural network by a policy improvement factor for the target objective dependent upon the Q-value output from the Q-value neural network for the target objective (p. 122, col. 2, paragraph 1, “This module balances the Q values to achieve the possible optimal Q values for the different objectives by averaging them or multiplying with the preference values”, p. 120, col. 1, paragraph 7, “meta-policy… that defines which policy needs to be counted and prioritise the objectives… a parity value (i.e. dynamic weight”) has been introduced”; and 
adjusting the action selection policy neural network parameters to optimize a combined policy improvement objective dependent upon a measure of a difference between a current action selection policy defined by current parameters of the action selection policy neural network and the improved action policy for each target objective (difference between current action selection policy and improved action policy is shown during backpropagation, p. 120, col. 2, paragraph 7, “Fig. 15 shows the relation mapping for different objectives based on selected policies… by adjusting the weights and bias for each neuron… determines the compromising solutions based on the balance of the several objectives. These Q values are forwarded by the DQNs which consist of the set of state and action values for a particular episode that is generated by the emulator. In the deep layer, the weights of the neural network are adjusted based on the backpropagation procedure”)

Regarding Claim 10, Hasan teaches the method of Claim 9 as referenced above. Hasan further teaches:
where each of the current action selection policy and the improved action policy for each target objective is defined by a distribution over actions dependent upon the observations and upon the preference settings, and wherein the measure of a difference comprises a measure of a difference between the distributions (p. 119, col. 2, paragraph 4, “The action a is selected by the policy in each state is given by the expression 𝑎𝑟𝑔𝑚𝑎𝑥𝑎𝑄 (𝑠, 𝑎)”, p. 120, col. 1, paragraph 7, “a parity value (i.e. dynamic weight) has been introduced before summing up the Q-values from DQNs that ensures equilibrium between objectives”, distribution over actions shown with greedy approach, “To reach all the states including the unvisited nodes, a 𝜀-greedy exploration policy has been implemented with annealing from 1 to 0.05”, difference between distributions is shown during backpropagation, p. 120, col. 2, paragraph 7, “From Fig. 15, it is observable that the agent determines the compromising solutions based on the balance of the several objectives… In the deep layer, the weights of the neural network are adjusted based on the backpropagation procedure”).  

Regarding Claim 11, Hasan teaches the method of Claim 9 as referenced above. Hasan further teaches:
wherein the policy improvement factor for the target objective is dependent upon a temperature parameter, the method further comprising 
obtaining the temperature parameter by optimizing a temperature objective function dependent upon (temperature parameter is epsilon optimized by annealing, p. 123, col. 2, paragraph 3, “a 𝜀-greedy exploration policy has been implemented with annealing from 1 to 0.05”), for each target objective, the Q-value output from the Q-value neural network for the target objective averaged over preference settings sampled from the preference distribution, observations received from the environment, and actions for the observations sampled from the current action selection policy (temperature parameter will be applied during action selection depending on Q-value, p. 122, col. 2, paragraph 1, “balances the Q values to achieve the possible optimal Q values for the different objectives by averaging them or multiplying with the preference values (e.g., if any)”, p. 122, col. 1, paragraph 6, “the Q values are selected based on the states and actions. The state is selected based on the DQN networks and this value is sent to a stack where the agent looks for the best Q value”).  

Regarding Claim 12, Hasan teaches the method of Claim 1 as referenced above. Hasan further teaches:
further comprising initializing the preference distribution such that a sampled vector of preference settings sampled from the preference distribution is likely to prioritize costs over rewards (p. 122, col. 1, paragraph 6, “From Fig. 17, it is observable that the Q values are selected based on the states and actions. The state is selected based on the DQN networks and this value is sent to a stack where the agent looks for the best Q value. After that, the highest value is mapped with the ORM module where it forwards the best compromising objectives after satisfying the constraints. This module balances the Q values to achieve the possible optimal Q values for the different objectives by averaging them or multiplying with the preference values (e.g., if any)”, agent tries to survive and prioritize costs to do so, “damage the health of the agent by −2. This health parameter creates another objective that needs to be satisfied over time to survive”).  

Regarding Claim 13, Hasan teaches the method of Claim 1 as referenced above. Hasan further teaches:
wherein the environment is a real-world environment and the constraints are constraints on costs incurred by the agent when acting in the real-world to perform the task (p. 107, Abstract, “the developed algorithm has been implemented to build an expert system for a real-world scenario using Markov decision process to identify the vulnerable zones based on water quality resilience in São Paulo, Brazil”, p. 117, col. 1, paragraph 7, “This environment is based on 22 zones… The objectives are similar to our DST environment where the agent gets −0.04 point for every step until it gets the vulnerable zone where it gets +1 as a reward and −1 for the resilient zone”).

Regarding Claim 15, Hasan teaches:
One or more non-transitory computer-readable storage media encoded with instructions that, when executed by one or more computers (Hasan uses python with libraries demonstrating that Hasan performs their method on a computer, in which processor, memory, and storage devices are inherent p. 123, col. 1, paragraph 1, “created based on MATLAB (2018a) with python version (3.5) with the libraries (i.e. tensorflow and keras)”), cause the one or more computers to perform operations for controlling an agent to perform one or more tasks while interacting with an environment (p. 107, Abstract, “a dynamic multiobjective testbed has been created which is a modified version of the conventional deep-sea treasure (DST) hunt testbed. This modified testbed fulfils the changing aspects of the dynamic environment in terms of the characteristics where the changes occur based on time”), each task having at least one respective reward associated with performance of the task (p. 115, Equation 13 showing reward function with rewards for a task), and wherein performing the one or more tasks is subject to one or more constraints on one or more respective costs (cost is penalty values and constraint is maximizing reward, p. 113, col. 2, paragraph 4, “This health parameter creates another objective that needs to be satisfied over time to survive” and “These two submarines can damage the health of the agent by −2”, p. 117, col. 1, paragraph 6, “the agent gets −0.04 point for every step until it gets the vulnerable zone where it gets +1 as a reward and −1 for the resilient zone”), the operations comprising:
maintaining an action selection policy neural network (Figure 17 bottom of figure, Action and ORM are action selection policy neural network, ORM is neural network as shown in p. 122, Figure 15 consisting of four different layers), having a plurality of action selection policy neural network parameters (p. 120, paragraph 2, “Fig. 15 shows the relation mapping for different objectives based on selected policies where the convolutional layers direct to form a compromised solution. This process is predominantly executed by the neural network of the deep layer and by adjusting the weights and bias for each neuron”), for selecting an action(p. 123, bottom of Figure 17, argmax is used on values from ORM to select action) in response to an observation from the environment (Figure 17 environment state which is observations leads to action selection policy neural network), wherein the action selection policy neural network is conditioned on a preference vector input comprising a vector of preference settings each defining a relative priority of one of the rewards or costs with respect to the others (Figure 17, preference vector input is p1-pn, each parity value defines relative priority, p. 120, col. 2, paragraph 1, “balancing the objectives using objective relation mapping based on a dynamic weight (parity value)”); 
maintaining a preference neural network, having a plurality of preference neural network parameters (Figure 17 DQNs are preference neural network, p. 124, Algorithm, line 20, “Train the Q network with pre-defined loss”, DQNs are trained showing they have parameters), defining a preference distribution for the vector of preference settings (vector of preference settings influence outputs of DQNs combined into a Q value, distribution is shown in different selection of dynamic weights (parity values) per episode, “The obtained Q value by the agent is the average value from the DQNs that characterise all the objectives. Therefore, the selected objectives are the representation of the compound structure of all the objectives. Needless to mention, this value is made by the Q values in a finite horizon. In other words, this can be represented as the most compromising solutions that the agent could achieve in a particular episode.”); 
selecting, at each of a plurality of time steps, an action to be performed by the agent in the environment by receiving an observation from the environment and a vector of preference settings sampled from the preference distribution defined by the preference neural network, providing the vector of preference settings and data from the observation to the action selection policy neural network (p. 119, col. 1, paragraph 2, “The agent starts from the start state 𝑆1…”, p. 120, col. 1, paragraph 8, “The set of 𝑄(𝑠, 𝑎) is updated with vectors from 𝑆 ′ at time step t for the first time after performing an action 𝑎 in state𝑆”, p. 123, Figure 17 shows action after pipeline of environment state observation and creating vector of preference values p inputted into action selection policy neural network (ORM and Action blocks)), using an output of the action selection policy neural network to select the action to be performed by the agent (p. 123, Figure 17 shows Action for agent as output into learning module), and receiving one or more rewards and one or more costs (p. 117, col. 1, paragraph 7, “the agent gets −0.04 point for every step until it gets the vulnerable zone where it gets +1 as a reward”, p. 113, col. 2, paragraph 4, “These two submarines can damage the health of the agent by −2”); 
fixing the preference neural network parameters (parameters are fixed after training, p. 124, Algorithm, line 20, “Train the Q network with pre-defined loss”) and adjusting the action selection policy neural network parameters to jointly optimize a set of target objectives, wherein each of the one or more rewards and each of the one or more costs has a respective one of the target objectives (p. 120, col. 2, paragraph 7, “Fig. 15 shows the relation mapping for different objectives based on selected policies… by adjusting the weights and bias for each neuron… determines the compromising solutions based on the balance of the several objectives. These Q values are forwarded by the DQNs which consist of the set of state and action values for a particular episode that is generated by the emulator. In the deep layer, the weights of the neural network are adjusted based on the backpropagation procedure”, Q value of each objective, p. 122, col. 2, paragraph 1, “Q values for the different objectives by averaging them or multiplying with the preference values”, p. 113, col. 2, paragraph 4, “damage the health of the agent by −2. This health parameter creates another objective that needs to be satisfied”); and 
fixing the action selection policy neural network parameters (parameters are fixed after adjustment, p. 120, col. 2, paragraph 7, “Fig. 15 shows the relation mapping for different objectives based on selected policies… by adjusting the weights and bias for each neuron”) and adjusting the preference neural network parameters to optimize a constraint objective dependent on whether each of the one or more constraints is satisfied by the one or more respective costs (parameters are adjusted during training, p. 124, Algorithm, line 20, “Train the Q network with pre-defined loss”, line 16 shows optimized to constraint objective where tt encompasses constraint satisfied by cost rr).

Regarding Claim 16, Hasan teaches:
A system comprising one or more computers and one or more storage devices storing instructions that when executed by one or more computers (Hasan uses python with libraries demonstrating that Hasan performs their method on a computer, in which processor, memory, and storage devices are inherent p. 123, col. 1, paragraph 1, “created based on MATLAB (2018a) with python version (3.5) with the libraries (i.e. tensorflow and keras)”) cause the one or more computers to perform operations comprising controlling an agent to perform one or more tasks while interacting with an environment (p. 107, Abstract, “a dynamic multiobjective testbed has been created which is a modified version of the conventional deep-sea treasure (DST) hunt testbed. This modified testbed fulfils the changing aspects of the dynamic environment in terms of the characteristics where the changes occur based on time”), each task having at least one respective reward associated with performance of the task (p. 115, Equation 13 showing reward function with rewards for a task), and wherein performing the one or more tasks is subject to one or more constraints on one or more respective costs (cost is penalty values and constraint is maximizing reward, p. 113, col. 2, paragraph 4, “This health parameter creates another objective that needs to be satisfied over time to survive” and “These two submarines can damage the health of the agent by −2”, p. 117, col. 1, paragraph 6, “the agent gets −0.04 point for every step until it gets the vulnerable zone where it gets +1 as a reward and −1 for the resilient zone”), the method comprising:
maintaining an action selection policy neural network (Figure 17 bottom of figure, Action and ORM are action selection policy neural network, ORM is neural network as shown in p. 122, Figure 15 consisting of four different layers), having a plurality of action selection policy neural network parameters (p. 120, paragraph 2, “Fig. 15 shows the relation mapping for different objectives based on selected policies where the convolutional layers direct to form a compromised solution. This process is predominantly executed by the neural network of the deep layer and by adjusting the weights and bias for each neuron”), for selecting an action(p. 123, bottom of Figure 17, argmax is used on values from ORM to select action) in response to an observation from the environment (Figure 17 environment state which is observations leads to action selection policy neural network), wherein the action selection policy neural network is conditioned on a preference vector input comprising a vector of preference settings each defining a relative priority of one of the rewards or costs with respect to the others (Figure 17, preference vector input is p1-pn, each parity value defines relative priority, p. 120, col. 2, paragraph 1, “balancing the objectives using objective relation mapping based on a dynamic weight (parity value)”); 
maintaining a preference neural network, having a plurality of preference neural network parameters (Figure 17 DQNs are preference neural network, p. 124, Algorithm, line 20, “Train the Q network with pre-defined loss”, DQNs are trained showing they have parameters), defining a preference distribution for the vector of preference settings (vector of preference settings influence outputs of DQNs combined into a Q value, distribution is shown in different selection of dynamic weights (parity values) per episode, “The obtained Q value by the agent is the average value from the DQNs that characterise all the objectives. Therefore, the selected objectives are the representation of the compound structure of all the objectives. Needless to mention, this value is made by the Q values in a finite horizon. In other words, this can be represented as the most compromising solutions that the agent could achieve in a particular episode.”); 
selecting, at each of a plurality of time steps, an action to be performed by the agent in the environment by receiving an observation from the environment and a vector of preference settings sampled from the preference distribution defined by the preference neural network, providing the vector of preference settings and data from the observation to the action selection policy neural network (p. 119, col. 1, paragraph 2, “The agent starts from the start state 𝑆1…”, p. 120, col. 1, paragraph 8, “The set of 𝑄(𝑠, 𝑎) is updated with vectors from 𝑆 ′ at time step t for the first time after performing an action 𝑎 in state𝑆”, p. 123, Figure 17 shows action after pipeline of environment state observation and creating vector of preference values p inputted into action selection policy neural network (ORM and Action blocks)), using an output of the action selection policy neural network to select the action to be performed by the agent (p. 123, Figure 17 shows Action for agent as output into learning module), and receiving one or more rewards and one or more costs (p. 117, col. 1, paragraph 7, “the agent gets −0.04 point for every step until it gets the vulnerable zone where it gets +1 as a reward”, p. 113, col. 2, paragraph 4, “These two submarines can damage the health of the agent by −2”); 
fixing the preference neural network parameters (parameters are fixed after training, p. 124, Algorithm, line 20, “Train the Q network with pre-defined loss”) and adjusting the action selection policy neural network parameters to jointly optimize a set of target objectives, wherein each of the one or more rewards and each of the one or more costs has a respective one of the target objectives (p. 120, col. 2, paragraph 7, “Fig. 15 shows the relation mapping for different objectives based on selected policies… by adjusting the weights and bias for each neuron… determines the compromising solutions based on the balance of the several objectives. These Q values are forwarded by the DQNs which consist of the set of state and action values for a particular episode that is generated by the emulator. In the deep layer, the weights of the neural network are adjusted based on the backpropagation procedure”, Q value of each objective, p. 122, col. 2, paragraph 1, “Q values for the different objectives by averaging them or multiplying with the preference values”, p. 113, col. 2, paragraph 4, “damage the health of the agent by −2. This health parameter creates another objective that needs to be satisfied”); and 
fixing the action selection policy neural network parameters (parameters are fixed after adjustment, p. 120, col. 2, paragraph 7, “Fig. 15 shows the relation mapping for different objectives based on selected policies… by adjusting the weights and bias for each neuron”) and adjusting the preference neural network parameters to optimize a constraint objective dependent on whether each of the one or more constraints is satisfied by the one or more respective costs (parameters are adjusted during training, p. 124, Algorithm, line 20, “Train the Q network with pre-defined loss”, line 16 shows optimized to constraint objective where tt encompasses constraint satisfied by cost rr).

	Regarding Claim 17, the rejection of Claim 16 is incorporated and further, the claim is rejected for the same reasons as set forth in Claim 2.

	Regarding Claim 18, the rejection of Claim 16 is incorporated and further, the claim is rejected for the same reasons as set forth in Claim 3.

	Regarding Claim 19, the rejection of Claim 16 is incorporated and further, the claim is rejected for the same reasons as set forth in Claim 4.

	Regarding Claim 20, the rejection of Claim 19 is incorporated and further, the claim is rejected for the same reasons as set forth in Claim 5.

	Regarding Claim 21, the rejection of Claim 16 is incorporated and further, the claim is rejected for the same reasons as set forth in Claim 6.


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JESSE CHEN COULSON whose telephone number is (571)272-4716. The examiner can normally be reached Monday-Friday 8:30-5:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached at (571) 272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/JESSE C COULSON/
Examiner, Art Unit 2122

/KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122
Read full office action
CONSTRAINED REINFORCEMENT LEARNING NEURAL NETWORK SYSTEMS USING PARETO FRONT OPTIMIZATION

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

CONSTRAINED REINFORCEMENT LEARNING NEURAL NETWORK SYSTEMS USING PARETO FRONT OPTIMIZATION

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email