Office Action Analysis: 18113350 — METHODS AND APPARATUS TO AUTOMATICALLY TUNE REINFORCEMENT LEARNING HYPERPARAMETERS ASSOCIATED WITH AGENT BEHAVIOR

Office Action

§101 §102 §103 §112
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . This action is made non-final.
Claims 1-27 are pending in the case.

Claim Objections
Claim 10 is objected to under 37 CFR 1.75 as being a substantial duplicate of claim 5. When two claims in an application are duplicates or else are so close in content that they both cover the same thing, despite a slight difference in wording, it is proper after allowing one claim to object to the other as being a substantial duplicate of the allowed claim. See MPEP § 608.01(m).

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(d):
(d) REFERENCE IN DEPENDENT FORMS.—Subject to subsection (e), a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers.

The following is a quotation of pre-AIA  35 U.S.C. 112, fourth paragraph:
Subject to the following paragraph [i.e., the fifth paragraph of pre-AIA  35 U.S.C. 112], a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers.

Claim 10 rejected under 35 U.S.C. 112(d) or pre-AIA  35 U.S.C. 112, 4th paragraph, as being of improper dependent form for failing to further limit the subject matter of the claim upon which it depends, or for failing to include all the limitations of the claim upon which it depends.  There are two Claim 10's, the first one is an exact copy of claim 5 and does not further limit the invention.  Applicant may cancel the claim(s), amend the claim(s) to place the claim(s) in proper dependent form, rewrite the claim(s) in independent form, or present a sufficient showing that the dependent claim(s) complies with the statutory requirements.

Claim Interpretation
	Because there are two claim 10’s and the first one is a copy of claim 5. For the purpose of this non-final rejection, the second claim 10, which recites “The method of claim 1, wherein the information associated with interactions of the agent with the environment includes a context associated with the environment, the context indicating a non-stationary nature of the environment.” Is the claim that will be examined and referenced to in the remainder of this paper.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


To determine if a claim is directed to patent ineligible subject matter, the Court has guided the Office to apply the Alice/Mayo test, which requires:
Step 1: Determining if the claim falls within a statutory category.
Step 2A: Determining if the claim is directed to a patent ineligible judicial exception consisting of a law of nature, a natural phenomenon, or abstract idea; and Step 2A is a two prong inquiry. MPEP 2106.04(II)(A). Under the first prong, examiners evaluate whether a law of nature, natural phenomenon, or abstract idea is set forth or described in the claim. Abstract ideas include mathematical concepts, certain methods of organizing human activity, and mental processes. MPEP 2104.04(a)(2). The second prong is an inquiry into whether the claim integrates a judicial exception into a practical application. MPEP 2106.04(d).
Step 2B: If the claim is directed to a judicial exception, determining if the claim recites limitations or elements that amount to significantly more than the judicial exception. (See MPEP 2106).
Claims 1-27 is/are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.

Step 1: Claims 1-12 are directed to a method (a process), Claim 13-20 is directed to a computing device comprising one or more processors (a machine), and Claims 21-27 is directed to a non-transitory computer-readable storage medium (a manufacture). Therefore, Claims XXX are directed to a process, machine or manufacture or composition of matter.

Regarding claim 1
Step 2A Prong 1
	Claim 1 recites the following mental processes, that in each case under the broadest reasonable interpretation, covers performance of the limitation in the mind (including observation, evaluation, judgement, opinion) or with the aid of pencil and paper but for recitation of generic computer components (e.g.,  “agent”, and  “environment”) [see MPEP 2106.04(a)(2)(III)].
	“determining an indication of a set of current values, each current value from the set of current values being associated with a different hyperparameter from the plurality of hyperparameters, the plurality of hyperparameters being configured to impact the agent's interactions with the environment” (e.g., data identification/parameter retrieval)
	“modifying the policy by automatically changing at least one current value from the set of current values based on the information associated with the interactions of the agent with the environment and the indication of the target state to increase a likelihood of the agent achieving the target state or maximizing gain of rewards over time” (e.g., evaluating information to make a judgment to change the value to improve success, evaluation/decision logic)
	Accordingly, at Step 2A, prong one, the claim recites an abstract idea.
Step 2A Prong 2
	The judicial exception is not integrated into a practical application.  In particular, the claim recites the additional elements of “agent”, and  “environment” which are recited at a high-level of generality such that they amount to no more than mere instructions to apply the exception using generic computer components (See MPEP 2106.05(f)). The Examiner notes that this is used throughout the claim limitations, and is rejected thusly for each claim which recites the same language.
Regarding the “receiving information associated with interactions of an agent with an environment, the interactions including a plurality of states associated with the environment and a plurality of actions associated with each state from the plurality of states, the interactions being according to a policy defined based on a plurality of hyperparameters” this additional element is recited at a high level of generality and amounts to extra-solution activity of receiving data, i.e. pre-solution activity of inputting data for use in the claimed process (see MPEP 2106.05(g)). The examiner notes that “, the interactions including a plurality of states associated with the environment and a plurality of actions associated with each state from the plurality of states, the interactions being according to a policy defined based on a plurality of hyperparameters” is merely defining the data that is being received for the process.
Regarding the “receiving an indication of a target state to be achieved by the agent in the environment” this additional element is recited at a high level of generality and amounts to extra-solution activity of receiving the desired target of the training process, i.e. pre-solution activity of inputting data for use in the claimed process (see MPEP 2106.05(g)).
Accordingly, at Step 2A, prong two, the additional elements individually or in combination do not integrate the judicial exception into a practical application. 
Step 2B
	In accordance with Step 2B, the claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception.  As discussed above, the additional element of a “agent”, and  “environment” which are recited at a high-level of generality such that they amount to no more than mere instructions to apply the exception using a generic computer component (See MPEP 2106.05(f)).
	Regarding the “receiving information associated with interactions of an agent with an environment, the interactions including a plurality of states associated with the environment and a plurality of actions associated with each state from the plurality of states, the interactions being according to a policy defined based on a plurality of hyperparameters”, and “receiving an indication of a target state to be achieved by the agent in the environment” limitations, these additional elements are recited at a high-level of generality and amounts to extra-solution activity of obtaining data to input for a model, i.e., pre-solution activity of data gathering. The courts have found limitations directed to obtaining information electronically, recited at a high-level of generality, to be well-understood, routine, and conventional (see MPEP 2106.05(d)(II), “receiving or transmitting data over a network”, "electronic record keeping," and "storing and retrieving information in memory").
	Accordingly, at Step 2B, the additional element individually or in combination does not amount to significantly more than the judicial exception.

Regarding claim 2	
Step 2A Prong 1
	Claim 2 does not recite an abstract idea, but is directed to the abstract idea identified in its parents claim(s).
	Accordingly, at Step 2A, prong one, the claim recites an abstract idea.
Step 2A Prong 2
	The judicial exception is not integrated into a practical application.  In particular, the claim recites the additional elements of “wherein the modifying the policy by automatically changing the at least one current value from the set of current values is done by the agent without an involvement of a user” which is recited at a high-level of generality such that they amount to no more than mere instructions to apply the exception using generic computer components (See MPEP 2106.05(f)). The Examiner notes that this is used to emphasize that the computer executes the modifying, however, MPEP 2106.05(f) does not meaningfully limit the claim or render the abstract idea eligible. The courts do not distinguish between mental processes performed by humans and those performed on a computer (MPEP 2106.04(a)(II).
	Accordingly, at Step 2A, prong two, the additional elements individually or in combination do not integrate the judicial exception into a practical application.
Step 2B
	In accordance with Step 2B, the claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception.  As discussed above, the additional element of a “wherein the modifying the policy by automatically changing the at least one current value from the set of current values is done by the agent without an involvement of a user” which is recited at a high-level of generality such that it amounts to no more than mere instructions to apply the exception using a generic computer component (See MPEP 2106.05(f)).
	Accordingly, at Step 2B, the additional element individually or in combination does not amount to significantly more than the judicial exception.

Regarding claim 3
Step 2A Prong 1
	Claim 3 does not recite an abstract idea, but is directed to the abstract idea identified in its parents claim(s).
	Accordingly, at Step 2A, prong one, the claim recites an abstract idea.
Step 2A Prong 2
	The judicial exception is not integrated into a practical application.  In particular, the claim recites the additional elements of “wherein the plurality of hyperparameters includes at least one of lambda indicating a learning rate, gamma indicating a measure of discount of future rewards, or epsilon indicating a coefficient of greediness in the agent's interactions with the environment” which is recited at a high-level of generality such that they amount to no more than generally linking the use of abstract idea to a particular technological environment or field of use using a generic computer component (See MPEP 2106.05(h)). In particular it is merely describing how the data is labeled for use in the claimed process.
	Accordingly, at Step 2A, prong two, the additional elements individually or in combination do not integrate the judicial exception into a practical application.
Step 2B
	In accordance with Step 2B, the claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception.  As discussed above, the additional element of a “wherein the plurality of hyperparameters includes at least one of lambda indicating a learning rate, gamma indicating a measure of discount of future rewards, or epsilon indicating a coefficient of greediness in the agent's interactions with the environment” which is recited at a high-level of generality such that they amount to no more than generally linking the use of abstract idea to a particular technological environment or field of use using a generic computer component (See MPEP 2106.05(h)). In particular it is merely describing how the data is labeled for use in the claimed process.
	Accordingly, at Step 2B, the additional element individually or in combination does not amount to significantly more than the judicial exception.

Regarding claim 4	
Step 2A Prong 1
	Claim 4 does not recite an abstract idea, but is directed to the abstract idea identified in its parents claim(s).
	Accordingly, at Step 2A, prong one, the claim recites an abstract idea.
Step 2A Prong 2
	The judicial exception is not integrated into a practical application.  In particular, the claim recites the additional elements of “determining a first measure of performance associated with interactions of the agent with the environment, the first measure of performance being based on the agent's interactions with the environment according to the first policy”, and “determining a second measure of performance associated with interactions of the agent with the environment, the second measure of performance being based on the agent's interactions with the environment according to a second policy different than the first policy" which are recited at a high-level of generality such that they amount to extra-solution activity of data analysis, i.e. post-solution activity of selecting a particular data source or type of data to be manipulated for use in the claimed process (see MPEP 2106.05(g)).
	Regarding the “calculating a difference between the first measure of performance and the second measure of performance, the changing at least one current value from the set of current values being based on the difference” limitation, which is recited at a high-level of generality such that it amounts to no more than mere instructions to apply the exception using a generic computer component (See MPEP 2106.05(f)). This could also be classified as a mathematical concept under MPEP 2106.04(a)(I). This limitation is taking the outcomes of the process to compare with math, and then modify the value depending on that math. Thus applying the result to the abstract idea of claim 1.
	Accordingly, at Step 2A, prong two, the additional elements individually or in combination do not integrate the judicial exception into a practical application.
Step 2B
	In accordance with Step 2B, the claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception.  As discussed above, the additional element of a “determining a first measure of performance associated with interactions of the agent with the environment, the first measure of performance being based on the agent's interactions with the environment according to the first policy”, and “determining a second measure of performance associated with interactions of the agent with the environment, the second measure of performance being based on the agent's interactions with the environment according to a second policy different than the first policy" limitations, these additional elements are recited at a high-level of generality and amounts to extra-solution activity of post-solution activity of selecting a particular data source or type of data to be manipulated. The courts have found limitations directed to obtaining information electronically, recited at a high-level of generality, to be well-understood, routine, and conventional (see MPEP 2106.05(d)(II), “receiving or transmitting data over a network”, "electronic record keeping," and "storing and retrieving information in memory").
	Regarding the “calculating a difference between the first measure of performance and the second measure of performance, the changing at least one current value from the set of current values being based on the difference” limitation, which is recited at a high-level of generality such that it amounts to no more than mere instructions to apply the exception using a generic computer component (See MPEP 2106.05(f)).
	Accordingly, at Step 2B, the additional element individually or in combination does not amount to significantly more than the judicial exception.

Regarding claim 5
Step 2A Prong 1
	Claim 5 does not recite an abstract idea, but is directed to the abstract idea identified in its parents claim(s).
	Accordingly, at Step 2A, prong one, the claim recites an abstract idea.
Step 2A Prong 2
	The judicial exception is not integrated into a practical application.  In particular, the claim recites the additional elements of “wherein the first measure of performance and the second measure of performance are at least one of a measure of rewards received in the course of the interactions of the agent with the environment or a measure of quality perceived by the agent in the course of the interactions of the agent with the environment" which is recited at a high-level of generality such that it amounts to extra-solution activity of identifying observed performance data, i.e. post-solution activity of selecting a particular data source or type of data to be manipulated for use in the claimed process (see MPEP 2106.05(g)).
	Accordingly, at Step 2A, prong two, the additional elements individually or in combination do not integrate the judicial exception into a practical application.
Step 2B
	In accordance with Step 2B, the claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception.  As discussed above, the additional element of a “wherein the first measure of performance and the second measure of performance are at least one of a measure of rewards received in the course of the interactions of the agent with the environment or a measure of quality perceived by the agent in the course of the interactions of the agent with the environment" limitation, the additional element is recited at a high-level of generality and amounts to extra-solution activity of post-solution activity of selecting a particular data source or type of data to be manipulated. The courts have found limitations directed to obtaining information electronically, recited at a high-level of generality, to be well-understood, routine, and conventional (see MPEP 2106.05(d)(II), “receiving or transmitting data over a network”, "electronic record keeping," and "storing and retrieving information in memory").
	Accordingly, at Step 2B, the additional element individually or in combination does not amount to significantly more than the judicial exception.

Regarding claim 6	
Step 2A Prong 1
	Claim 6 does not recite an abstract idea, but is directed to the abstract idea identified in its parents claim(s).
	Accordingly, at Step 2A, prong one, the claim recites an abstract idea.
Step 2A Prong 2
	The judicial exception is not integrated into a practical application.  In particular, the claim recites the additional elements of “wherein the second policy is associated with greedy interactions of the agent with the environment" which is recited at a high-level of generality such that they amount to no more than generally linking the use of abstract idea to a particular technological environment or field of use using a generic computer component (See MPEP 2106.05(h)). In particular it is merely describing how the data is labeled for use in the claimed process.
	Accordingly, at Step 2A, prong two, the additional elements individually or in combination do not integrate the judicial exception into a practical application.
Step 2B
	In accordance with Step 2B, the claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception.  As discussed above, the additional element of a “wherein the second policy is associated with greedy interactions of the agent with the environment” which is recited at a high-level of generality such that they amount to no more than generally linking the use of abstract idea to a particular technological environment or field of use using a generic computer component (See MPEP 2106.05(h)). In particular it is merely describing how the data is labeled for use in the claimed process.
	Accordingly, at Step 2B, the additional element individually or in combination does not amount to significantly more than the judicial exception.

Regarding claim 7	
Step 2A Prong 1
	Claim 7 recites the following mathematical concepts, that in each case under the broadest reasonable interpretation, covers performance of mathematical relationships, mathematical formulas or equations, and mathematical calculations but for recitation of generic computer components (e.g., “processor”, “machine learning model”) [see MPEP 2106.04(a)(2)(I)].
	“computing a measure of variance associated with the first measure of performance associated with the interactions of the agent with the environment” (e.g., measuring the output of a model to get a statistical description to a result )
	Accordingly, at Step 2A, prong one, the claim recites an abstract idea.
Step 2A Prong 2
	The judicial exception is not integrated into a practical application.  In particular, the claim recites the additional elements of “adjusting the difference between the first measure of performance and the second measure of performance based on the measure of variance, the changing at least one current value from the set of current values being based on the difference after the adjusting” which is recited at a high-level of generality such that it amounts to no more than mere instructions to apply the exception using generic computer components (See MPEP 2106.05(f)).	
	Accordingly, at Step 2A, prong two, the additional elements individually or in combination do not integrate the judicial exception into a practical application.
Step 2B
	In accordance with Step 2B, the claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception.  As discussed above, the additional element of “adjusting the difference between the first measure of performance and the second measure of performance based on the measure of variance, the changing at least one current value from the set of current values being based on the difference after the adjusting” which is recited at a high-level of generality such that it amounts to no more than mere instructions to apply the exception using a generic computer component (See MPEP 2106.05(f)).
	Accordingly, at Step 2B, the additional element individually or in combination does not amount to significantly more than the judicial exception.

Regarding claim 8
Step 2A Prong 1
	Claim 8 recites the following mathematical concepts, that in each case under the broadest reasonable interpretation, covers performance of mathematical relationships, mathematical formulas or equations, and mathematical calculations but for recitation of generic computer components (e.g., “processor”, “machine learning model”) [see MPEP 2106.04(a)(2)(I)].	
	 “wherein the adjusting the difference between the first measure of performance and the second measure of performance based on the measure of variance includes computing a ratio of the difference between the first measure of performance and the second measure of performance and the measure of variance” (e.g., mathematical calculation/relationship comparing two values to generate a ratio)
	Accordingly, at Step 2A, prong one, the claim recites an abstract idea.
Step 2A Prong 2
	In accordance with Step 2A, Prong 2, the claim does not include any additional elements and the judicial exception is not integrated into a practical application.
Step 2B
	In accordance with Step 2B, the claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.  

Regarding claim 9
Step 2A Prong 1
	Claim 9 recites the following mathematical concepts, that in each case under the broadest reasonable interpretation, covers performance of mathematical relationships, mathematical formulas or equations, and mathematical calculations but for recitation of generic computer components (e.g., “processor”, “machine learning model”) [see MPEP 2106.04(a)(2)(I)].	
	 “computing a standard deviation associated with the rewards associated with the agent's interactions with the environment according to the first policy” (e.g., mathematical calculation quantifying variability of reward values)
	“computing a Sharpe ratio of a difference the first measure of performance and the second measure of performance and the standard deviation associated with the rewards associated with the agent's interactions with the environment according to the first policy, the modifying the policy by automatically changing at least one current value from the set of current values being based on the Sharpe ratio” (e.g., statistical ratio relating expected reward differences to reward variability)
	Accordingly, at Step 2A, prong one, the claim recites an abstract idea.
Step 2A Prong 2
	The judicial exception is not integrated into a practical application.  In particular, the claim recites the additional elements of “wherein the first measure of performance is an expected value of rewards associated with the agents interactions with the environment according to the first policy and the second measure of performance is an expected value of rewards associated with the agent's interactions with the environment according to the second policy" which is recited at a high-level of generality such that it amounts to extra-solution activity of identifying observed performance data, i.e. post-solution activity of selecting a particular data source or type of data to be manipulated for use in the claimed process (see MPEP 2106.05(g)).
	Accordingly, at Step 2A, prong two, the additional elements individually or in combination do not integrate the judicial exception into a practical application.
Step 2B
	In accordance with Step 2B, the claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception.  As discussed above, the additional element of a “wherein the first measure of performance is an expected value of rewards associated with the agents interactions with the environment according to the first policy and the second measure of performance is an expected value of rewards associated with the agent's interactions with the environment according to the second policy" limitation, the additional element is recited at a high-level of generality and amounts to extra-solution activity of post-solution activity of selecting a particular data source or type of data to be manipulated. The courts have found limitations directed to obtaining information electronically, recited at a high-level of generality, to be well-understood, routine, and conventional (see MPEP 2106.05(d)(II), “receiving or transmitting data over a network”, "electronic record keeping," and "storing and retrieving information in memory").
	Accordingly, at Step 2B, the additional element individually or in combination does not amount to significantly more than the judicial exception.

Regarding claim 10	
Step 2A Prong 1
	Claim 10 does not recite an abstract idea, but is directed to the abstract idea identified in its parents claim(s).
	Accordingly, at Step 2A, prong one, the claim recites an abstract idea.
Step 2A Prong 2
	The judicial exception is not integrated into a practical application.  In particular, the claim recites the additional elements of “wherein the information associated with interactions of the agent with the environment includes a context associated with the environment, the context indicating a non-stationary nature of the environment" which is recited at a high-level of generality such that they amount to no more than generally linking the use of abstract idea to a particular technological environment or field of use using a generic computer component (See MPEP 2106.05(h)). In particular it is merely describing how the data is labeled for use in the claimed process.
	Accordingly, at Step 2A, prong two, the additional elements individually or in combination do not integrate the judicial exception into a practical application.
Step 2B
	In accordance with Step 2B, the claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception.  As discussed above, the additional element of a “wherein the information associated with interactions of the agent with the environment includes a context associated with the environment, the context indicating a non-stationary nature of the environment” which is recited at a high-level of generality such that they amount to no more than generally linking the use of abstract idea to a particular technological environment or field of use using a generic computer component (See MPEP 2106.05(h)). In particular it is merely describing how the data is labeled for use in the claimed process.
	Accordingly, at Step 2B, the additional element individually or in combination does not amount to significantly more than the judicial exception.

Regarding claim 11
Step 2A Prong 1
	Claim 11 recites the following mental processes, that in each case under the broadest reasonable interpretation, covers performance of the limitation in the mind (including observation, evaluation, judgement, opinion) or with the aid of pencil and paper but for recitation of generic computer components (e.g.,  “agent”, and  “environment”) [see MPEP 2106.04(a)(2)(III)].	
	 “comparing the context associated with the first environment with the context associated with the second environment, the modifying the policy by changing at least one current value from the set of current values being based on the comparing” (e.g., evaluating information form two outputs to modify a policy)
	Accordingly, at Step 2A, prong one, the claim recites an abstract idea.
Step 2A Prong 2
	The judicial exception is not integrated into a practical application.  In particular, the claim recites the additional elements of “retrieving context associated with a second environment different from the first environment, the context associated with the second environment indicating a non-stationary nature of the second environment" which is recited at a high-level of generality such that it amounts to extra-solution activity of receiving additional data for further determination, i.e. post-solution activity of data gathering for use in the claimed process (see MPEP 2106.05(g)).
	Accordingly, at Step 2A, prong two, the additional elements individually or in combination do not integrate the judicial exception into a practical application.
Step 2B
	In accordance with Step 2B, the claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception.  As discussed above, the additional element of a “retrieving context associated with a second environment different from the first environment, the context associated with the second environment indicating a non-stationary nature of the second environment" limitation, the additional element is recited at a high-level of generality and amounts to extra-solution activity of post-solution data gathering. The courts have found limitations directed to obtaining information electronically, recited at a high-level of generality, to be well-understood, routine, and conventional (see MPEP 2106.05(d)(II), “receiving or transmitting data over a network”, "electronic record keeping," and "storing and retrieving information in memory").
	Accordingly, at Step 2B, the additional element individually or in combination does not amount to significantly more than the judicial exception.

Regarding claim 12
Step 2A Prong 1
	Claim 12 does not recite an abstract idea, but is directed to the abstract idea identified in its parents claim(s).
	Accordingly, at Step 2A, prong one, the claim recites an abstract idea.
Step 2A Prong 2
	The judicial exception is not integrated into a practical application.  In particular, the claim recites the additional elements of “wherein the first measure of performance and the second measure of performance are based on the agent's interactions with the environment in a predetermined first time period, the difference between the first measure of performance and the second measure of performance being a first difference, the method further comprising", determining a third measure of performance associated with interactions of the agent with the environment, the third measure of performance being based on the agent's interactions with the environment according to a third policy different than the second policy, the agent's interactions with the environment being in a predetermined second time period different than the first time period”, and “determining a fourth measure of performance associated with interactions of the agent with the environment, the fourth measure of performance being based on the agent's interactions with the environment according to the second policy and in the predetermined second time period”, which are recited at a high-level of generality such that they amount to extra-solution activity of identifying observed performance data, i.e. post-solution activity of selecting a particular data source or type of data to be manipulated for use in the claimed process (see MPEP 2106.05(g)).
	Regarding the “calculating a second difference between the third measure of performance and the fourth measure of performance; and comparing the second difference with the first difference, the changing at least one current value from the set of current values being based on the comparing” limitation, which is recited at a high-level of generality such that it amounts to no more than mere instructions to apply the exception using a generic computer component (See MPEP 2106.05(f)). This could also be classified as a mathematical concept under MPEP 2106.04(a)(I). This limitation is taking the outcomes of the process to compare with math, and then modify the value depending on that math.
	Accordingly, at Step 2A, prong two, the additional elements individually or in combination do not integrate the judicial exception into a practical application.
Step 2B
	In accordance with Step 2B, the claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception.  As discussed above, the additional element of a “wherein the first measure of performance and the second measure of performance are based on the agent's interactions with the environment in a predetermined first time period, the difference between the first measure of performance and the second measure of performance being a first difference, the method further comprising", determining a third measure of performance associated with interactions of the agent with the environment, the third measure of performance being based on the agent's interactions with the environment according to a third policy different than the second policy, the agent's interactions with the environment being in a predetermined second time period different than the first time period”, and “determining a fourth measure of performance associated with interactions of the agent with the environment, the fourth measure of performance being based on the agent's interactions with the environment according to the second policy and in the predetermined second time period”, limitations, these additional elements are recited at a high-level of generality and amounts to extra-solution activity of post-solution activity of selecting a particular data source or type of data to be manipulated. The courts have found limitations directed to obtaining information electronically, recited at a high-level of generality, to be well-understood, routine, and conventional (see MPEP 2106.05(d)(II), “receiving or transmitting data over a network”, "electronic record keeping," and "storing and retrieving information in memory").
	Regarding the “calculating a second difference between the third measure of performance and the fourth measure of performance; and comparing the second difference with the first difference, the changing at least one current value from the set of current values being based on the comparing” limitation, which is recited at a high-level of generality such that it amounts to no more than mere instructions to apply the exception using a generic computer component (See MPEP 2106.05(f)).
	Accordingly, at Step 2B, the additional element individually or in combination does not amount to significantly more than the judicial exception.

Regarding 13
	Claim 13 corresponds to claims 1, and 4, with the added recitation of a generic hardware system and processors executing instructions to perform the same abstract method steps as those claims. With the addition of generic components executing instructions with are insufficient to render the claims subject matter eligible for the same reasons described above.

Regarding claim 14	
Step 2A Prong 1
	Claim 14 does not recite an abstract idea, but is directed to the abstract idea identified in its parents claim(s).
	Accordingly, at Step 2A, prong one, the claim recites an abstract idea.
Step 2A Prong 2
	The judicial exception is not integrated into a practical application.  In particular, the claim recites the additional elements of “wherein the plurality of hyperparameters includes epsilon which indicates a coefficient of greediness associated with interactions of the agent with the environment, and the second measure of performance is based on interactions of the agent with the environment according to the second policy in which epsilon is indicated to be below a predefined threshold value" which is recited at a high-level of generality such that they amount to no more than generally linking the use of abstract idea to a particular technological environment or field of use using a generic computer component (See MPEP 2106.05(h)). In particular it is merely describing how the data is labeled for use in the claimed process.
	Accordingly, at Step 2A, prong two, the additional elements individually or in combination do not integrate the judicial exception into a practical application.
Step 2B
	In accordance with Step 2B, the claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception.  As discussed above, the additional element of a “wherein the plurality of hyperparameters includes epsilon which indicates a coefficient of greediness associated with interactions of the agent with the environment, and the second measure of performance is based on interactions of the agent with the environment according to the second policy in which epsilon is indicated to be below a predefined threshold value” which is recited at a high-level of generality such that they amount to no more than generally linking the use of abstract idea to a particular technological environment or field of use using a generic computer component (See MPEP 2106.05(h)). In particular it is merely describing how the data is labeled for use in the claimed process.
	Accordingly, at Step 2B, the additional element individually or in combination does not amount to significantly more than the judicial exception.

Regarding claim 15
	Claim 15 corresponds to claim 5, with the added recitation of a generic hardware system and processors executing instructions to perform the same abstract method steps as that claim. With the addition of generic components executing instructions with are insufficient to render the claims subject matter eligible for the same reasons described above.
Regarding claim 16	
Step 2A Prong 1
	Claim 16 does not recite an abstract idea, but is directed to the abstract idea identified in its parents claim(s).
	Accordingly, at Step 2A, prong one, the claim recites an abstract idea.
Step 2A Prong 2
	The judicial exception is not integrated into a practical application.  In particular, the claim recites the additional elements of “wherein the first measure of performance is a rate of rewards over a predetermined time period in a recent history of the agent's interactions with the environment according to the first policy, and the second measure of performance is a rate of rewards over the predetermined time period in the recent history of the agent's interactions with the environment according to the second policy”, which are recited at a high-level of generality such that they amount to extra-solution activity of identifying observed performance data, i.e. post-solution activity of selecting a particular data source or type of data to be manipulated for use in the claimed process (see MPEP 2106.05(g)).
	Accordingly, at Step 2A, prong two, the additional elements individually or in combination do not integrate the judicial exception into a practical application.
Step 2B
	In accordance with Step 2B, the claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception.  As discussed above, the additional element of a “wherein the first measure of performance is a rate of rewards over a predetermined time period in a recent history of the agent's interactions with the environment according to the first policy, and the second measure of performance is a rate of rewards over the predetermined time period in the recent history of the agent's interactions with the environment according to the second policy”, limitations, these additional elements are recited at a high-level of generality and amounts to extra-solution activity of post-solution activity of selecting a particular data source or type of data to be manipulated. The courts have found limitations directed to obtaining information electronically, recited at a high-level of generality, to be well-understood, routine, and conventional (see MPEP 2106.05(d)(II), “receiving or transmitting data over a network”, "electronic record keeping," and "storing and retrieving information in memory").
	Accordingly, at Step 2B, the additional element individually or in combination does not amount to significantly more than the judicial exception.

Regarding 17
	Claim 17 corresponds to claims 7, and 8, with the added recitation of a generic hardware system and processors executing instructions to perform the same abstract method steps as those claims. With the addition of generic components executing instructions with are insufficient to render the claims subject matter eligible for the same reasons described above.
Regarding claim 18	
Step 2A Prong 1
	Claim 18 does not recite an abstract idea, but is directed to the abstract idea identified in its parents claim(s).
	Accordingly, at Step 2A, prong one, the claim recites an abstract idea.
Step 2A Prong 2
	The judicial exception is not integrated into a practical application.  In particular, the claim recites the additional elements of “wherein the hardware processor is configured to change the at least one current value from the set of current values such that the ratio is increased towards a target value " which is recited at a high-level of generality such that it amounts to extra-solution activity of optimizing a value toward a target , i.e. post-solution activity of selecting a particular data source or type of data to be manipulated for use in the claimed process (see MPEP 2106.05(g)).
	Accordingly, at Step 2A, prong two, the additional elements individually or in combination do not integrate the judicial exception into a practical application.
Step 2B
	In accordance with Step 2B, the claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception.  As discussed above, the additional element of a “wherein the hardware processor is configured to change the at least one current value from the set of current values such that the ratio is increased towards a target value” , limitation, this additional element is recited at a high-level of generality and amounts to extra-solution activity of post-solution activity of selecting a particular data source or type of data to be manipulated. The courts have found limitations directed to obtaining information electronically, recited at a high-level of generality, to be well-understood, routine, and conventional (see MPEP 2106.05(d)(II), “receiving or transmitting data over a network”, "electronic record keeping," and "storing and retrieving information in memory").
	Accordingly, at Step 2B, the additional element individually or in combination does not amount to significantly more than the judicial exception.

Regarding claim 19	
Step 2A Prong 1
	Claim 19 does not recite an abstract idea, but is directed to the abstract idea identified in its parents claim(s).
	Accordingly, at Step 2A, prong one, the claim recites an abstract idea.
Step 2A Prong 2
	The judicial exception is not integrated into a practical application.  In particular, the claim recites the additional elements of “implement a second agent different than the first agent, the second agent configured to automatically perform the determining the first measure of performance, the determining the second measure of performance, and the changing of at least one current value from a set of current values " which is recited at a high-level of generality such that it amounts to extra-solution activity of selecting an additional source of interaction data, i.e. post-solution activity of selecting a particular data source or type of data to be manipulated for use in the claimed process (see MPEP 2106.05(g)).
	Accordingly, at Step 2A, prong two, the additional elements individually or in combination do not integrate the judicial exception into a practical application.
Step 2B
	In accordance with Step 2B, the claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception.  As discussed above, the additional element of a “implement a second agent different than the first agent, the second agent configured to automatically perform the determining the first measure of performance, the determining the second measure of performance, and the changing of at least one current value from a set of current values”, limitation, this additional element is recited at a high-level of generality and amounts to extra-solution activity of post-solution activity of selecting a particular data source or type of data to be manipulated. The courts have found limitations directed to obtaining information electronically, recited at a high-level of generality, to be well-understood, routine, and conventional (see MPEP 2106.05(d)(II), “receiving or transmitting data over a network”, "electronic record keeping," and "storing and retrieving information in memory").
	Accordingly, at Step 2B, the additional element individually or in combination does not amount to significantly more than the judicial exception.

Regarding claim 20
	Claim 20 corresponds to claim 3, with the added recitation of a generic hardware system and processors executing instructions to perform the same abstract method steps as that claim. With the addition of generic components executing instructions with are insufficient to render the claims subject matter eligible for the same reasons described above.

Regarding claims 21-26
	Claims 21-26 recites a non-transitory computer readable medium. Each of these claims corresponds directly to the method and system steps recited in claims 1-20, with the addition of generic components executing instructions with are insufficient to render the claims subject matter eligible for the same reasons described above.
	Specifically:	Claim 21 corresponds to claims 1, 10, 11 and 19, with the added recitation of a generic hardware system and processors executing instructions to perform the same abstract method steps as those claims.
	Claim 22 corresponds to claim 3, with the added recitation of a generic hardware system and processors executing instructions to perform the same abstract method steps as that claim.
	Claim 23 corresponds to claims 10, and 11, with the added recitation of a generic hardware system and processors executing instructions to perform the same abstract method steps as those claims.	Claim 24 corresponds to claims 4, and 14, with the added recitation of a generic hardware system and processors executing instructions to perform the same abstract method steps as those claims.	Claim 25 corresponds to claims 12, and 16 with the added recitation of a generic hardware system and processors executing instructions to perform the same abstract method steps as those claims.	Claim 26 corresponds to claim 14, with the added recitation of a generic hardware system and processors executing instructions to perform the same abstract method steps as that claim.

Regarding claim 27	
Step 2A Prong 1
	Claim 27 does not recite an abstract idea, but is directed to the abstract idea identified in its parents claim(s).
	Accordingly, at Step 2A, prong one, the claim recites an abstract idea.
Step 2A Prong 2
	The judicial exception is not integrated into a practical application.  In particular, the claim recites the additional elements of “wherein the instructions comprising code to cause the processor to implement the second agent to perform the action include code to cause the processor to store a configuration including data associated with the first set of interactions based on the first policy, the second set of interactions based on the second policy, and the set of current values associated with the plurality of hyperparameters” which is recited at a high-level of generality such that it amounts to no more than mere instructions to apply the exception using generic computer components (See MPEP 2106.05(f)).	
	Accordingly, at Step 2A, prong two, the additional elements individually or in combination do not integrate the judicial exception into a practical application.
Step 2B
	In accordance with Step 2B, the claim does not include additional elements that are sufficient to amount to significantly more that the judicial exception.  As discussed above, the additional element of “wherein the instructions comprising code to cause the processor to implement the second agent to perform the action include code to cause the processor to store a configuration including data associated with the first set of interactions based on the first policy, the second set of interactions based on the second policy, and the set of current values associated with the plurality of hyperparameters” which is recited at a high-level of generality such that it amounts to no more than mere instructions to apply the exception using a generic computer component (See MPEP 2106.05(f)).
	Accordingly, at Step 2B, the additional element individually or in combination does not amount to significantly more than the judicial exception.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

(g)(1) during the course of an interference conducted under section 135 or section 291, another inventor involved therein establishes, to the extent permitted in section 104, that before such person’s invention thereof the invention was made by such other inventor and not abandoned, suppressed, or concealed, or (2) before such person’s invention thereof, the invention was made in this country by another inventor who had not abandoned, suppressed, or concealed it. In determining priority of invention under this subsection, there shall be considered not only the respective dates of conception and reduction to practice of the invention, but also the reasonable diligence of one who was first to conceive and last to reduce to practice, from a time prior to conception by the other.


Claim(s) 1-6, 10-13, 15, 16, and 19-23 is/are rejected under 35 U.S.C. 102(a)(2) as being anticipated by McClement et al. (US 20220291642 A1, referred to as McClement).

	Regarding claim 1, McClement teaches, a method, comprising ([0029-0031]: Describes methods used to execute instructions for a deep reinforcement leaning algorithm.): 
	receiving information associated with interactions of an agent with an environment, the interactions including a plurality of states associated with the environment and a plurality of actions associated with each state from the plurality of states, the interactions being according to a policy defined based on a plurality of hyperparameters (FIG. 2, and [0060-0065]: Describes a DRL agent interacting with an industrial process environment, observing process states and executing control actions according to a learned policy. It stores trajectories of a state, action and reward information for training and adaptation. These disclose receiving information associated with agent-environment interactions comprising states and actions according to a policy.); 
	receiving an indication of a target state to be achieved by the agent in the environment ([0006-0009]: Describes that the system receives an indication of a target state or objective to be achieved by an agent in an environment. The reinforcement learning system operates with respect to a control objective associated with an industrial process, wherein context data and historical process data are summarized to represent both process dynamics and a control objective.); 
	determining an indication of a set of current values, each current value from the set of current values being associated with a different hyperparameter from the plurality of hyperparameters, the plurality of hyperparameters being configured to impact the agent's interactions with the environment ([0044]: Describes that interactions with the environment generate experience tuples of states, actions, rewards, and next states, and that the actor selects actions using the state s and latent context variable z. These parameter values directly impact the agent’s interactions with the environment.; [0076]: Describes a hidden internal state Z evolving with observed RL states, and says the agent conditions actions on st  and Zt. It also says the “meta parameters quickly adapt a control policy for an MDP by solving for a suitable set of MDP-specific parameters”.; [0079]: Describes that the embeddings/hidden state “includes information to tune the PI parameters” and that the embeddings represent process-specific RL parameters that are updated as the agent’s knowledge of process dynamics changes, and that layers recommend adjustments to PL parameters. These corresponds to a set of parameters with current values, used online impacting interaction/actions and updated over time.); and 
	modifying the policy by automatically changing at least one current value from the set of current values based on the information associated with the interactions of the agent with the environment and the indication of the target state to increase a likelihood of the agent achieving the target state or maximizing gain of rewards over time ([0072]: Describes that the meta-RL agent automatically adjusts control parameters used by the policy based on observed performance.; [0078]: Describes that these parameters govern policy behavior in controlling the process.; [0085]:” Describes that the adjustment is performed autonomously to optimize reward or control objectives. These teach automatically modifying a policy by changing parameter values to improve reward outcomes.).  

	Regarding claim 2, McClement teaches, wherein the modifying the policy by automatically changing the at least one current value from the set of current values is done by the agent without an involvement of a user ([0076-0079]: Describes that the meta-RL agent autonomously adapts control parameters ring operation based observed interaction data and inferred process dynamics, without requiring operator input.; [0090]: Describes an algorithmic flow in which controller parameters are automatically updated in response to changing process conditions.). 
 
	Regarding claim 3, McClement teaches, wherein the plurality of hyperparameters includes at least one of lambda indicating a learning rate, gamma indicating a measure of discount of future rewards, or epsilon indicating a coefficient of greediness in the agent's interactions with the environment([0044]: Describes that the plurality of hyperparameters includes at least one hyperparameter that governs learning or reward evaluation in reinforcement leaning. An actor-critic reinforcement learning framework in which the agent learns form cumulative reward signals obtained through interactions with the environment.; [0076-0083]: Describes that the agent’s learning and policy adaptation are based on reward outcomes over episodes and trajectories, which inherently involve evaluating future rewards relative to current rewards, consistent with the use of discount factor (ϒ) in reinforcement leaning. The policy and controller parameters are updated through learning processes that depend on reward feedback and iterative parameter adjustment, which necessarily involve a learning rate-type hyperparameter controlling the magnitude of parameter updates.).

	Regarding claim 4, McClement teaches, wherein the policy is a first policy, the method further comprising: 
	determining a first measure of performance associated with interactions of the agent with the environment, the first measure of performance being based on the agent's interactions with the environment according to the first policy ([0044]: Describes determining a measure of performance associated with interactions of an agent with an environment according to a policy.  The interactions between the agent and the environment generate experience tuples including states, actions and rewards, and that these rewards are used to evaluate the agent’s performance under a given policy.; [0076-0083 Describes that the agent operates according to a policy during an episode and that cumulative reward or performance metrics derived from those interactions are used to assess control effectiveness. The cumulative reward or performance evaluation constitutes a first measure of performance based on interactions according to a first policy.); 
	determining a second measure of performance associated with interactions of the agent with the environment, the second measure of performance being based on the agent's interactions with the environment according to a second policy different than the first policy ([0076-0079]: Describes determining a second measure of performance associated with interactions of the agent with the environment according to a second policy different from the first policy. The meta-RL agent adapts and modifies controller parameters, resulting in different policy behaviors across episodes or operating conditions.; [0090]: Describes evaluating agent performance under these different parameterized polices as part of the learning and adaptation process. Erach distinct parameter configuration defines a different policy governing agent interaction, and performance metrics are determined based on interactions under each such policy.); and 
	calculating a difference between the first measure of performance and the second measure of performance, the changing at least one current value from the set of current values being based on the difference ([0083-0086]: Describes adjusting controller parameters based on differences in performance observed under different policy behaviors. This evaluates performance metrics associated with interactions under different parameterized policies.; [0090]: Describes that controller parameters are updated based on such performance evaluations in order to improve control outcomes. Updating parameters in response to comparative performance results necessarily involves calculating a difference between performance measures obtained under different policies and using that difference as a basis for parameter modification.).  

	Regarding claim 5, McClement teaches, wherein the first measure of performance and the second measure of performance are at least one of a measure of rewards received in the course of the interactions of the agent with the environment or a measure of quality perceived by the agent in the course of the interactions of the agent with the environment ([0074]: Describes penalties associated with overshoot, oscillation, or excessive control action magnitude, which represent perceived quality of control during interaction with the environment. ; [0080-0086: Describes that the first and second measures of performance are based on rewards or measures of quality perceived during interactions between the agent and the environment. This evaluates performance using cost or reward based metrics, wherein a critic estimates the agent’s discounted future cost base on interaction trajectories and such cost values are used to guide learning and adaptation of the policy. The evaluating performance uses quality-based measures such as deviation forma  target trajectory or setpoint. Measuring performance using squared error form a target trajectory and comparing system output trajectories against desired target trajectories to assess control quality.).  

	Regarding claim 6, McClement teaches, wherein the second policy is associated with greedy interactions of the agent with the environment ([0074]: Describes that the second policy is associated with greedy interactions of the agent with the environment. The different policy behaviors resulting from different parameterizations of the controller, including deterministic and probabilistic policy variants, and compares their performance across environments.; [0083-0086]: Describes that certain controllers avoid oscillatory or exploratory behavior and instead select actions that prioritize immediate control performance under the current environment conditions. Further, discloses the use of discounted future cost evaluation, which emphasizes near-term reward or cost outcomes and is characteristic of greedy or exploitation-oriented policy behavior. These correspond to deterministic or exploitation focused controllers select actions intended to directly optimize performance metrics such as tracking accuracy or penalty minimization, rather than exploring alternative actions.).  

	Regarding claim 10,McClement teaches, wherein the information associated with interactions of the agent with the environment includes a context associated with the environment, the context indicating a non-stationary nature of the environment ([0074-0079]: Describes context-based meta-reinforcement learning in which a latent representation is learned for each task or environment, enabling the agent to capture process dynamics and underlying structure of different plant. During an episode the agent maintains a hidden internal state zt that evolves over time based on observed states of the environment, and that recurrent layers create embeddings containing information about system dynamics and uncertainty that are updated as the agent’s knowledge of the process dynamics changes.; [0086[ Describes environments with changing process dynamics, including drifting process lag time and step changes in process gain, and explains that the meta-RL agent adapts controller parameters in response to these changes. The changing dynamics are indicative of a non-stationary environment. Corresponding to information associated with interactions of the agent with the environment includes context associated with the environment, and that such context indicates a non-stationary nature of the environment.).  

	Regarding claim 11, McClement teaches, wherein the environment is a first environment, the method further comprising: 
	retrieving context associated with a second environment different from the first environment, the context associated with the second environment indicating a non-stationary nature of the second environment ([0074-0076]: Describes retrieving context associated with a second environment that is different from a first environment. The meta-reinforcement learning is performed over a distribution of Markov decision processes (MDPs), such as MDP 1, and MDP 2, where each MDP corresponds to a different environment or task. The context-based meta-RL methods learn a latent representation for each task or environment, enabling the meta-RL agent t capture the dynamics and structure of different plants. For each environment, the agent maintains a hidden internal state zt that evolves over time based on observed states, providing environment specific context information.; [0086]: Describes that environments may exhibit changing process dynamics, such as drifting lag time or step changes in process gain, which are indicative of non-stationary nature of the environment.); and 
	comparing the context associated with the first environment with the context associated with the second environment, the modifying the policy by changing at least one current value from the set of current values being based on the comparing([0074-0079]: Describes learning latent representations that encode environment specific dynamics for different MDPs within a task distribution. These context representations are used by the meta-RL agent to adapt and tune controller parameters, such that different parameter values and policy behaviors are selected depending on the learned context of the environment. By learning and utilizing different context embeddings for different environments, the meta-RL agent inherently compares context information across environments to determine how policy parameters should be adapted.: [0090] Describes that controller parameters are adaptively updated as process dynamics change over time, demonstrating that policy modification is based on the learned and compared context associate with different environments.).  

	Regarding claim 12, McClement teaches, wherein the first measure of performance and the second measure of performance are based on the agent's interactions with the environment in a predetermined first time period, the difference between the first measure of performance and the second measure of performance being a first difference, the method further comprising ([0076]: Describes the measures of performance are based on agent interactions occurring over predetermined time periods. For each MDP the agent interacts with the environment over an episode having a defined number of time steps T, during which performance is evaluated.; [0080-0085]: Describes evaluating performance using cumulative or discounted cost over such episodes and measuring tracking error or control quality over defined time horizons following setpoint changes. Performance measures associated with different policy parameterizations are compared, thereby yielding a difference between performance measures within a given time period.): 
	determining a third measure of performance associated with interactions of the agent with the environment, the third measure of performance being based on the agent's interactions with the environment according to a third policy different than the second policy, the agent's interactions with the environment being in a predetermined second time period different than the first time period ([0076-0090]: Describes determining additional measure performance under different policies and at different times. The meta-RL agent adapts controller parameters over time, resulting in different policy behaviors across successive episodes or operating periods. The performance is evaluated under these different parameterized policies, including after changes in process dynamics or controller adaptation. These evaluations occur over different episodes or time horizons, each of which corresponds to predetermined time period.); 	
	determining a fourth measure of performance associated with interactions of the agent with the environment, the fourth measure of performance being based on the agent's interactions with the environment according to the second policy and in the predetermined second time period ([0076-0090]: Describes evaluating performance of the agent under a given policy during a later time period. AS discussed above, performance metrics are repeatedly determined for a controller policy as the agent interacts with the environment over successive episodes or time horizons. Because this evaluates performance for different parameterized polices during different episodes, including after parameter adaptation or environmental changes, this teaches determining a fourth measure of performance associated with interactions according to a previously evaluated policy during the predetermined second time period.); 
	calculating a second difference between the third measure of performance and the fourth measure of performance ([0074-0085]: Describes comparing outcomes obtained under different policy behaviors during a later time period. Performance metrics are evaluated for different parameter configurations and sues such evaluations to assess control effectiveness over time. Determining comparative performance between two policies within the same time period necessarily involves calculating a difference between performance measures, this leads to yielding a second difference.); and 
	comparing the second difference with the first difference, the changing at least one current value from the set of current values being based on the comparing ([0076-0090]: Describes repeatedly evaluates agent performance under different policies and updates controller parameters in response to observed improvements or degradations in performance. These modify controller parameters based on comparative performance evaluations across time Updating parameters based on whether performance differences improve or worsen over successive time periods necessarily involves comparing performance differences obtained at different times and adjusting parameter values accordingly.).  

	Regarding claim 13, which recites substantially the same limitations as claims 1 and 4. Claim 13 further recites a memory, and a hardware processor(McClement [FIG. 1, and [0038]: Describes that the methods are configured to run on a system with at least on processor connected to memory.)to perform the method steps of claims 1 and 4, respectively, and is therefore rejected on the same premise.

	Regarding claim 15, which recites substantially the same limitations as claim 5. Claim 15 further recites  the apparatus of claim 13(McClement [FIG. 1, and [0038]: Describes that the methods are configured to run on a system with at least on processor connected to memory)to perform the method steps of claim 5, respectively, and is therefore rejected on the same premise.

	Regarding claim 16. The apparatus of claim 13, wherein the first measure of performance is a rate of rewards over a predetermined time period in a recent history of the agent's interactions with the environment according to the first policy, and the second measure of performance is a rate of rewards over the predetermined time period in the recent history of the agent's interactions with the environment according to the second policy ([0076-0085]: Describes evaluating cumulative or discounted reward over defined episodes or finite horizons and measuring performance over recent interaction periods following setpoint changes. Such evaluations inherently represent reward rates over predetermined time periods.).  

	Regarding claim 19, McClement teaches, wherein the agent is a first agent and the hardware processor is further configured to:
	implement a second agent different than the first agent, the second agent configured to automatically perform the determining the first measure of performance, the determining the second measure of performance, and the changing of at least one current value from a set of current values ([0076-0079]: Describes a meta-reinforcement learning architecture comprising an RL agent that performs control actions in the environment and meta-RL agent that learns and applies parameter updates to tin the RL agent. The meta-RL agent operates online to determine performance information based on the RL agent’s interactions and to recommend or apply adjustments to controller parameters. The meta-RL agent performs the determining of performance measures and the changing of current values associated with policy parameters, while being distinct form the RL agent that interacts with the environment.).  

	Regarding claim 20, which recites substantially the same limitations as claim 3. Claim 20 further recites the apparatus of claim 13(McClement [FIG. 1, and [0038]: Describes that the methods are configured to run on a system with at least on processor connected to memory) to perform the method steps of claim 3, respectively, and is therefore rejected on the same premise.

	Regarding claim 21, substantially the same as claim which recites substantially the same limitations as claims 1, 10, 11 and 19. Claim 21 further recites a non-transitory processor-readable medium storing code representing instructions to be executed by a processor (McClement [0006], [0038], and [0048]: Describes a system using memory to store instructions to execute the programmed instructions.)to perform the method and system steps of claims 1, 10, 11 and 19, respectively, and is therefore rejected on the same premise.

	Regarding claim 22, which recites substantially the same limitations as claim 3. Claim 22 further recites the medium of claim 21 (McClement [0006], [0038], and [0048]: Describes a system using memory to store instructions to execute the programmed instructions.)to perform the method steps of claim 3, respectively, and is therefore rejected on the same premise.

	Regarding claim 23, which recites substantially the same limitations as claims 10 and 11. Claim 23 further recites the medium of claim 21 (McClement [0006], [0038], and [0048]: Describes a system using memory to store instructions to execute the programmed instructions.) to perform the method steps of claims 10 and 11, respectively, and is therefore rejected on the same premise.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 7, 8, 9, 17, and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over McClement et al. (US 20220291642 A1, referred to as McClement), in view of Tamar et al. (“Policy Gradients with Variance Related Risk Criteria”, referred to as Tamar).

	Regarding claim 7. McClement teaches, the method of claim 4.
	Although McClement teaches the method of claim 4, it does not teach computing a measure of variance associated with the first measure of performance associated with the interactions of the agent with the environment adjusting the difference between the first measure of performance and the second measure of performance based on the measure of variance, the changing at least one current value from the set of current values being based on the difference after the adjusting.
	Tamar teaches, computing a measure of variance associated with the first measure of performance associated with the interactions of the agent with the environment (Page 2-3, Section 2.1: Describes a variance V of cumulative reward for a policy and computes it from agent interaction trajectories. Corresponding to computing a measure of variance associated with performance of agent interactions.); and 
	adjusting the difference between the first measure of performance and the second measure of performance based on the measure of variance, the changing at least one current value from the set of current values being based on the difference after the adjusting (Page 2-3, Section 2.1: Describes adjusting performance based optimization using a measure of variance. It incorporates variance of the return, including normalizing expected performance by the standard deviation of the return or subtracting a variance related penalty term. The variance aware objectives adjust differences in performance by accounting for variability in rewards thereby producing a variance adjusted performance signal. It further updates policy parameters based on this variance adjusted objective through policy gradient methods.).  
	It would have been obvious to one of ordinary skill in the art at the time of the claimed invention to have combined the teachings of McClement with the teachings of Tamar. Doing so would enable the system to provide  a basis for adjusting the difference between the first and second performance measures based on variance, and for changing at least one current value form the set of current values based on the adjusted difference.

	Regarding claim 8, Tamar teaches, wherein the adjusting the difference between the first measure of performance and the second measure of performance based on the measure of variance includes computing a ratio of the difference between the first measure of performance and the second measure of performance and the measure of variance (Tamar Page 2-3, Section 2.1 and Equation 4: Describes optimizing the Sharpe Ration defined as expected return divided by the standard deviation of return (Eq. 4), which is a ratio of performance difference normalized by variance. This teaches adjusting a performance difference using a variance based ratio.).  

	Regarding claim 9, Tamar teaches, wherein the first measure of performance is an expected value of rewards associated with the agents interactions with the environment according to the first policy and the second measure of performance is an expected value of rewards associated with the agent's interactions with the environment according to the second policy, the method further comprising (As taught by McClement in claims 4-5, see above.): 	
	computing a standard deviation associated with the rewards associated with the agent's interactions with the environment according to the first policy (Page 2-3, Section 2.1: Describes computing a standard deviation of rewards associated with agent interactions under a policy. V is defined  as the variance of the cumulative reward (return) for a policy and further utilize the standard deviation                         
                            
                                v
                            
                        
                     in their risk-sensitive optimization objectives. Since the standard deviation is directly derived from the computed variance of rewards obtained through agent-environment interaction, this teaches computing a standard deviation associated with the rewards for a  policy.); and 
	computing a Sharpe ratio of a difference the first measure of performance and the second measure of performance and the standard deviation associated with the rewards associated with the agent's interactions with the environment according to the first policy, the modifying the policy by automatically changing at least one current value from the set of current values being based on the Sharpe ratio (Page 2-3, Section 2.1: Describes Sharpe Ratio as the ratio of expected reward to reward standard deviation, and further describe policy updates based on this ratio.).

	Regarding claim 17, which recites substantially the same limitations as claims 7 and 8. Claim 17 further recites the apparatus of claim 13(McClement [FIG. 1, and [0038]: Describes that the methods are configured to run on a system with at least on processor connected to memory) to perform the method steps of claims 7 and 8, respectively, and is therefore rejected on the same premise.

	Regarding claim 18, Tamar teaches, wherein the hardware processor is configured to change the at least one current value from the set of current values such that the ratio is increased towards a target value (Pages 2-4, Section 2 and Equation 12: Describes changing policy parameters so as to increase a variance normalized performance ratio toward an optimal target value. It identifies as a performance criterion the objective to maximize the Sharpe Ratio. Further taught is gradient based parameter update rules the adjust policy parameters in the direction that increases the Sharpe ration objective (Equation 12), thereby increasing the ratio toward its maximizing value).  

Claim(s) 14, and 24-27 is/are rejected under 35 U.S.C. 103 as being unpatentable over McClement et al. (US 20220291642 A1, referred to as McClement), in view of Eleftheriadis et al. (US 20200218999 A1, referred to as Eleftheriadi).

	Regarding claim 14,
	McClement teaches evaluating performance under different policies and automatically updaintnig policy defining parameters based on perfromance differences [0074-0090] Teaching an exploration focused (greedy) policy variant for performance evaluation. But does not teach, wherein the plurality of hyperparameters includes epsilon which indicates a coefficient of greediness associated with interactions of the agent with the environment, and the second measure of performance is based on interactions of the agent with the environment according to the second policy in which epsilon is indicated to be below a predefined threshold value.
	Eleftheriadis teaches, wherein the plurality of hyperparameters includes epsilon which indicates a coefficient of greediness associated with interactions of the agent with the environment, and the second measure of performance is based on interactions of the agent with the environment according to the second policy in which epsilon is indicated to be below a predefined threshold value ([0034]: Describes a epsilon greedy reinforcement learning in which epsilon is a hyperparameter indicating a coefficient of greediness or exploration, and wherein reducing epsilon below a predefined threshold causes the agent to select greedy actions with higher probability.).  	
	It would have been obvious to one of ordinary skill in the art at the time of the claimed invention to have combined the teachings of McClement with Eleftheriadis. Doing so would have enabled the system to control policy greediness when evaluating performance under a second policy.

	Regarding claim 24, which recites substantially the same limitations as claim 14. Claim 24 further recites (Eleftheriadis [0034]: describes a epsilon greedy reinforcement learning in which epsilon controls exploration versus greedy action selection)to perform the system steps of claim 14, respectively, and is therefore rejected on the same premise.

	Regarding claim 25, which recites substantially the same limitations as claims 12 and 16. Claim 25 further recites the medium of claim 21 (McClement [0006], [0038], and [0048]: Describes a system using memory to store instructions to execute the programmed instructions.) to perform the method steps of claims 12 and 16, respectively, and is therefore rejected on the same premise.

	Regarding claim 26, which recites substantially the same limitations as claim 14. Claim 26 further recites the medium of claim 21 (McClement [0006], [0038], and [0048]: Describes a system using memory to store instructions to execute the programmed instructions.) and an epsilon hyperparameter (Eleftheriadis [0034]: describes a epsilon greedy reinforcement learning in which epsilon controls exploration versus greedy action selection) to perform the system steps of claim 14, respectively, and is therefore rejected on the same premise.

	Regarding claim 27, McClement in view of Eleftheriadis teaches, the non-transitory processor-readable medium of claim 24.
	McClement in view of Eleftheriadis further teaches, wherein the instructions comprising code to cause the processor to implement the second agent to perform the action include code to cause the processor to store a configuration including data associated with the first set of interactions based on the first policy, the second set of interactions based on the second policy, and the set of current values associated with the plurality of hyperparameters.  
	McClement describes storing interaction data form interactions (first set of interactions and second set of interactions: [0074-0090] that interactions generate experience data that are stored in memory for use in performance evaluation and controller tuning. It further teaches evaluating and adapting controller behavior under different policy parameterizations, generating and storing interaction data corresponding to different polices. McClement teaches storing data associated with a first set of interactions based on a first policy and a second set of interactions based on a second policy.). However, it does not teach that the system stores policy data/policy structure.
	Eleftheriadis describes storing policy ([0054-0055]: Describes storing interaction/experience information in an experience database which stores experience data associated with agent-environment interactions, and may be provided for use in an experience buffer for additional learning or evaluation.;[0064-0066]: Describes storing configuration in memory in the buffer and policy data in local device memory, maintaining stored data associated with interactions and stored policy-related information used to control the agents behavior.)

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. See attached PTO-892 for additional references including:
US 20200320435 A1: reinforcement learning agents
US 20210158162 A1: hyperparameter tuning in reinforcement learning
US 20220398460 A1: non-stationary environments

Any inquiry concerning this communication or earlier communications from the examiner should be directed to DONALD T RODEN whose telephone number is (571)272-6441. The examiner can normally be reached Mon-Thur 8:00-5:00 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Omar Fernandez Rivas can be reached at (571) 272-2589. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/D.T.R./Examiner, Art Unit 2128                                                                                                                                                                                                        
/OMAR F FERNANDEZ RIVAS/Supervisory Patent Examiner, Art Unit 2128
Read full office action
METHODS AND APPARATUS TO AUTOMATICALLY TUNE REINFORCEMENT LEARNING HYPERPARAMETERS ASSOCIATED WITH AGENT BEHAVIOR

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

METHODS AND APPARATUS TO AUTOMATICALLY TUNE REINFORCEMENT LEARNING HYPERPARAMETERS ASSOCIATED WITH AGENT BEHAVIOR

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email