Last updated: April 19, 2026
Application No. 18/365,818
APPARATUS AND METHOD FOR AUTOMATED REWARD SHAPING

Non-Final OA §101§103§112
Filed
Aug 04, 2023
Examiner
HALES, BRIAN J
Art Unit
2125
Tech Center
2100 — Computer Architecture & Software
Assignee
Huawei Technologies Co., Ltd.
OA Round
1 (Non-Final)
Interview Optional

— +32.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 84 resolved cases, 2023–2026
Examiner Intelligence

HALES, BRIAN J View full profile →
Grants 77% — above average
Career Allow Rate
65 granted / 84 resolved
+22.4% vs TC avg
Strong +32% interview lift
Without
With
+32.0%
Interview Lift
resolved cases with interview
Typical timeline
4y 0m
Avg Prosecution
22 currently pending
Career history
106
Total Applications
across all art units
Statute-Specific Performance

§101
36.2%
-3.8% vs TC avg
§103
30.6%
-9.4% vs TC avg
§102
5.1%
-34.9% vs TC avg
§112
26.0%
-14.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 84 resolved cases
Office Action

§101 §103 §112
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This action is in response to preliminary amendments and remarks filed on 08/04/2023. In the current amendments, the specification is amended, and the abstract is amended. Claims 1-15 are pending and have been examined.

Information Disclosure Statement
The information disclosure statements (IDS) submitted on 07/16/2024, 07/24/2024, and 10/22/24 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 4-7 and 10-12 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 4 recites the limitation “the respective iteration” in line 3. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the respective iteration” has been interpreted as “a respective iteration”.
Claim 5 recites the limitation “the outcome” in line 2. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the outcome” has been interpreted as “an outcome”.
Claim 5 recites the limitation “the sum” in line 3. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the sum” has been interpreted as “a sum”.
Claim 6 recites the limitation “the reward function” in line 1. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the reward function” has been interpreted as “the predetermined reward function” in reference to “a predetermined reward function” in lines 13-14 of claim 1.
Claim 6 recites the limitation “the objective” in line 2. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the objective” has been interpreted as “the predetermined objective” in reference to “a predetermined objective” in line 3 of claim 1.
Claim 7 recites the limitation “the outcome” in line 3. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the outcome” has been interpreted as “an outcome”.
Claim 10 recites the limitation “the subsequent environmental state” in lines 1-2. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the subsequent environmental state” has been interpreted as “a subsequent environmental state”.
Claim 10 recites the limitation “the first agent function” in line 2. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the first agent function” has been interpreted as “a first agent function”.
Claim 10 recites the limitation “the current environmental state” in lines 2-3. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the current environmental state” has been interpreted as “a current environmental state”.
Claim 11 recites the limitation “the performance” in line 1. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the performance” has been interpreted as “a performance”.
Claim 11 recites the limitation “the first agent function” in line 2. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the first agent function” has been interpreted as “a first agent function”.
Claim 11 recites the limitation “the subsequent environmental state” in lines 2-3. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the subsequent environmental state” has been interpreted as “a subsequent environmental state”.
Claim 11 recites the limitation “the current environmental state” in line 3. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the current environmental state” has been interpreted as “a current environmental state”.
Claim 12 recites the limitation “the predetermined objective” in line 16. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the predetermined objective” has been interpreted as “a predetermined objective”.
Dependent claim 6 is rejected based on being directly or indirectly dependent on rejected claim 5.




Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-15 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 
Regarding Claim 1,
Claim 1 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 1 is directed to an apparatus, which is directed to a machine, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“form an output value function for achieving a predetermined objective”
“(i) implementing a current state of the first agent function in dependence on a current environmental state to form a subsequent environmental state and a first reward”
“(ii) a first determining step comprising determining, using the second agent function, whether to use a second reward”
“(iii) in a condition where that first determining step has a negative outcome, refining the first agent function in dependence on the first reward”
“otherwise in a condition where the first determining step has a positive outcome, computing the second reward according to a predetermined reward function and refining the first agent function in dependence on the first reward and the second reward”
“(iv) refining the second agent function in dependence on a performance of the first agent function in meeting the predetermined objective”
“(v) adopting the subsequent environmental state as the current environmental state”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)) and insignificant extra-solution activity (See MPEP 2106.05(g)). The above limitations in the context of this claim encompass forming an output value function for achieving a predetermined objective (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can form an output value function); implementing a current state of the first agent function depending on a current environmental state to form a subsequent environmental state and a first reward (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can form a subsequent environmental state and a first reward by implementing a current state of the first action function in dependance on a current environmental state); determining whether to use a second reward based on using the second agent function (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can use the second agent function to determine whether to use a second reward); refining the first agent function in dependence on the first reward when the first determining step has a negative outcome (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can, when the first determining step has a negative outcome, refine the first agent function in dependence on the first reward); computing the second reward according to a predetermined reward function and refining the first agent function in dependence of the first and second rewards when the first determining step has a positive outcome (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can, when the first determining step has a positive outcome, use a predetermined reward function to compute the second reward and refine the first agent function in dependence of the first and second rewards); refining the second agent function in dependence on the performance of the first agent function meeting the predetermined objective (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can, based on the performance of the first agent function in meeting the predetermined objective, refine the second agent function); and adopting the subsequent environmental state as the current environmental state (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can adopt the subsequent state as the current state).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)) or insignificant extra-solution activity (See MPEP 2106.05(g)). The limitations:
“one or more processors”
As drafted, are additional elements that amount to no more than mere instructions to apply the exception for the abstract ideas. See MPEP 2106.05(f). The limitations:
“receiving an initial environment state, an initial state of a first agent function, and an initial state of a second agent function”
“outputting the current state of the first agent function as the output value function”
As drafted, are additional elements that correspond to insignificant extra-solution activity. In particular, the additional elements are merely directed towards mere data gathering. See MPEP 2106.05(g). Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe generic processors for applying the abstract ideas) or insignificant extra-solution activity (i.e. receiving and outputting/transmitting data). Furthermore, the “receiving …” and “outputting …” limitations are insignificant extra-solution activity that is well-understood, routine, and conventional according to MPEP 2106.05(d) (“The courts have recognized the following computer functions as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity… i. Receiving or transmitting data over a network). Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 2,
Claim 2 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 2 is directed to an apparatus, which is directed to a machine, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein the first determining step comprises computing a binary value representing whether or not to use the second reward”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)) and insignificant extra-solution activity (See MPEP 2106.05(g)). The above limitations in the context of this claim encompass computing a binary value representing whether or not to use the second reward (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can compute a binary value representing whether to use the second reward or not).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)) or insignificant extra-solution activity (See MPEP 2106.05(g)). The recitation of additional elements in claim 1 of generic processors are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. Furthermore, the “receiving …” and “outputting …” limitations of claim 1 are additional elements that correspond to insignificant extra-solution activity as mere data gathering. Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe generic processors for applying the abstract ideas) or insignificant extra-solution activity (i.e. receiving and outputting/transmitting data). Furthermore, the “receiving …” and “outputting …” limitations are insignificant extra-solution activity that is well-understood, routine, and conventional according to MPEP 2106.05(d) (“The courts have recognized the following computer functions as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity… i. Receiving or transmitting data over a network). Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 3,
Claim 3 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 3 is directed to an apparatus, which is directed to a machine, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein the step of refining the second agent function is performed in dependence on an objective function, which comprises a negative cost element upon determining, on a respective iteration, that the determination of whether to use the second reward has a positive outcome”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)) and insignificant extra-solution activity (See MPEP 2106.05(g)). The above limitations in the context of this claim encompass refining the second agent function based on an objective function comprising a negative cost element based on determination of whether to use the second reward having a positive outcome (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can, when the determination of whether to use the second reward has a positive outcome, refine the second agent function in dependence on an objective function comprising a negative cost element).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)) or insignificant extra-solution activity (See MPEP 2106.05(g)). The recitation of additional elements in claim 1 of generic processors are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. Furthermore, the “receiving …” and “outputting …” limitations of claim 1 are additional elements that correspond to insignificant extra-solution activity as mere data gathering. Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe generic processors for applying the abstract ideas) or insignificant extra-solution activity (i.e. receiving and outputting/transmitting data). Furthermore, the “receiving …” and “outputting …” limitations are insignificant extra-solution activity that is well-understood, routine, and conventional according to MPEP 2106.05(d) (“The courts have recognized the following computer functions as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity… i. Receiving or transmitting data over a network). Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 4,
Claim 4 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 4 is directed to an apparatus, which is directed to a machine, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein the step of refining the second agent function comprises a second determining step comprising: determining whether the subsequent environmental state formed on the respective iteration is in a set of relatively infrequently visited states” 
“wherein the step of refining the second agent function is performed in dependence on an objective function, which comprises a positive reward element in a condition where, on a respective iteration, that the second determining step has a positive outcome”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)) and insignificant extra-solution activity (See MPEP 2106.05(g)). The above limitations in the context of this claim encompass refining the second agent function based on determining whether the subsequent environmental state is in a set of relatively infrequently visited states (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can determine whether or not the subsequent environmental state is in a set of relatively infrequently visited states for refining the second agent function); and refining the second agent function based on an objective function comprising a positive reward element based on determination of whether the subsequent environmental state is in a set of relatively infrequently visited states (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can, when the determination of whether to use the second reward has a positive outcome, refine the second agent function in dependence on an objective function comprising a positive reward element).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)) or insignificant extra-solution activity (See MPEP 2106.05(g)). The recitation of additional elements in claim 1 of generic processors are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. Furthermore, the “receiving …” and “outputting …” limitations of claim 1 are additional elements that correspond to insignificant extra-solution activity as mere data gathering. Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe generic processors for applying the abstract ideas) or insignificant extra-solution activity (i.e. receiving and outputting/transmitting data). Furthermore, the “receiving …” and “outputting …” limitations are insignificant extra-solution activity that is well-understood, routine, and conventional according to MPEP 2106.05(d) (“The courts have recognized the following computer functions as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity… i. Receiving or transmitting data over a network). Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 5,
Claim 5 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 5 is directed to an apparatus, which is directed to a machine, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“in a condition where the outcome of the first determining step is positive, refine the first agent function in dependence on the sum of the first reward and the second reward”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)) and insignificant extra-solution activity (See MPEP 2106.05(g)). The above limitations in the context of this claim encompass refining the first agent function in dependence on the sum of the first and second rewards when the outcome of the first determining step is positive (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can, when the outcome of the first determining step is positive, refine the first agent function based on the sum of the first and second rewards).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)) or insignificant extra-solution activity (See MPEP 2106.05(g)). The limitations:
“the one or more processors”
As drafted, are additional elements that amount to no more than mere instructions to apply the exception for the abstract ideas. See MPEP 2106.05(f). In addition, the recitation of additional elements in claim 1 of generic processors are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. Furthermore, the “receiving …” and “outputting …” limitations of claim 1 are additional elements that correspond to insignificant extra-solution activity as mere data gathering. Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe generic processors for applying the abstract ideas) or insignificant extra-solution activity (i.e. receiving and outputting/transmitting data). Furthermore, the “receiving …” and “outputting …” limitations are insignificant extra-solution activity that is well-understood, routine, and conventional according to MPEP 2106.05(d) (“The courts have recognized the following computer functions as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity… i. Receiving or transmitting data over a network). Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 6,
Claim 6 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 6 is directed to an apparatus, which is directed to a machine, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein the reward function is such that summing the first reward and the second reward preserves pursuit of the objective”
As drafted, is part of the abstract idea of claim 5 of computing the second reward according to a predetermined reward function. The limitation of claim 6 further limits the limitation of claim 5 by further defining what the reward function comprises. The above limitation in the context of this claim encompasses when the first determining step has a positive outcome, computing the second reward according to a predetermined reward function, the reward function is such that summing the first reward and the second reward preserves pursuit of the objective, and refining the first agent function in dependence of the first and second rewards (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can, when the first determining step has a positive outcome, use a predetermined reward function, the reward function being such that summing the first reward and the second reward preserves pursuit of the objective, to compute the second reward and refine the first agent function in dependence of the first and second rewards);
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)) or insignificant extra-solution activity (See MPEP 2106.05(g)). The recitation of additional elements in claim 5 of generic processors are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. Furthermore, the “receiving …” and “outputting …” limitations of claim 5 are additional elements that correspond to insignificant extra-solution activity as mere data gathering. Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe generic processors for applying the abstract ideas) or insignificant extra-solution activity (i.e. receiving and outputting/transmitting data). Furthermore, the “receiving …” and “outputting …” limitations are insignificant extra-solution activity that is well-understood, routine, and conventional according to MPEP 2106.05(d) (“The courts have recognized the following computer functions as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity… i. Receiving or transmitting data over a network). Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 7,
Claim 7 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 7 is directed to an apparatus, which is directed to a machine, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“compute the second reward only in a condition where the outcome of the first determining step is positive”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)) and insignificant extra-solution activity (See MPEP 2106.05(g)). The above limitations in the context of this claim encompass computing the second reward only when the outcome of the first determining step is positive (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can, when the outcome of the first determining step is positive, compute the second reward).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)) or insignificant extra-solution activity (See MPEP 2106.05(g)). The limitations:
“the one or more processors”
As drafted, are additional elements that amount to no more than mere instructions to apply the exception for the abstract ideas. See MPEP 2106.05(f). In addition, the recitation of additional elements in claim 1 of generic processors are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. Furthermore, the “receiving …” and “outputting …” limitations of claim 1 are additional elements that correspond to insignificant extra-solution activity as mere data gathering. Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe generic processors for applying the abstract ideas) or insignificant extra-solution activity (i.e. receiving and outputting/transmitting data). Furthermore, the “receiving …” and “outputting …” limitations are insignificant extra-solution activity that is well-understood, routine, and conventional according to MPEP 2106.05(d) (“The courts have recognized the following computer functions as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity… i. Receiving or transmitting data over a network). Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 8,
Claim 8 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 8 is directed to an apparatus, which is directed to a machine, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein the first reward is determined in dependence on the subsequent environmental state”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)) and insignificant extra-solution activity (See MPEP 2106.05(g)). The above limitations in the context of this claim encompass determining the first reward in dependence on the subsequent environmental state (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can determine the first reward based on the subsequent environmental state).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)) or insignificant extra-solution activity (See MPEP 2106.05(g)). The recitation of additional elements in claim 1 of generic processors are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. Furthermore, the “receiving …” and “outputting …” limitations of claim 1 are additional elements that correspond to insignificant extra-solution activity as mere data gathering. Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe generic processors for applying the abstract ideas) or insignificant extra-solution activity (i.e. receiving and outputting/transmitting data). Furthermore, the “receiving …” and “outputting …” limitations are insignificant extra-solution activity that is well-understood, routine, and conventional according to MPEP 2106.05(d) (“The courts have recognized the following computer functions as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity… i. Receiving or transmitting data over a network). Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 9,
Claim 9 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 9 is directed to an apparatus, which is directed to a machine, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“form an output value function for achieving a predetermined objective by iteratively learning successive candidates for the output value function”
“(i) in each iteration, a first reward dependent on an environmental state determined by a current state of the output value function”
“(ii) in at least some iterations, a second reward formed by a second value function”
“learn the second value function over successive iterations”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)). The above limitations in the context of this claim encompass forming an output value function for achieving a predetermined objective by iteratively learning successive candidates for the output value function (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can learn successive candidates for the output value function in order to form an output value function for achieving a predetermined objective); determining a first reward dependent on an environmental state based on a current state of the output value function (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can, based on a current state of the output value function, determine a first reward dependent on an environmental state); forming a second reward by using a second value function (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can use a second value function to form a second reward); and learning the second value function over successive iterations (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can learn the second value function over successive iterations).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)). The limitations:
“one or more processors”
As drafted, are additional elements that amount to no more than mere instructions to apply the exception for the abstract ideas. See MPEP 2106.05(f). Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe generic processors for applying the abstract ideas). Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 10,
Claim 10 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 10 is directed to an apparatus, which is directed to a machine, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein the subsequent environmental state is formed by a single iteration of the first agent function taking the current environmental state as input”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)). The above limitations in the context of this claim encompass forming the subsequent environmental state based on a single iteration of the first agent function taking the current environmental state as input (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can form the subsequent environmental state based on a single iteration of the first agent function taking the current environmental state as input).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)). The recitation of additional elements in claim 9 of generic processors are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe generic processors for applying the abstract ideas). Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 11,
Claim 11 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 11 is directed to an apparatus, which is directed to a machine, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein the performance of the first agent function in meeting the predetermined objective is formed in dependence on the subsequent environmental state and/or the current environmental state”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) and mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)) and insignificant extra-solution activity (See MPEP 2106.05(g)). The above limitations in the context of this claim encompass forming the performance of the first agent function in meeting the predetermined objective based on the subsequent and/or current environmental state (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can form, in dependence on the subsequent and/or current environmental state, the performance of the first agent function in meeting the predetermined objective).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)). The recitation of additional elements in claim 9 of generic processors are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe generic processors for applying the abstract ideas). Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 12,
Claim 12 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 12 is directed to a method, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“forming an output value function”
“(i) implementing a current state of the first agent function in dependence on a current environmental state to form a subsequent environmental state and a first reward”
“(ii) a first determining step comprising determining, using the second agent function, whether to use a second reward”
“(iii) in a condition where the first determining step has a negative outcome, refining the first agent function in dependence on the first reward” 
“otherwise in a condition where the first determining step has a positive outcome, computing the second reward according to a predetermined reward function and refining the first agent function in dependence on the first reward and the second reward”
“(iv) refining the second agent function in dependence on a performance of the first agent function in meeting the predetermined objective”
“(v) adopting the subsequent environmental state as the current environmental state”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)) and insignificant extra-solution activity (See MPEP 2106.05(g)). The above limitations in the context of this claim encompass forming an output value function (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can form an output value function); implementing a current state of the first agent function depending on a current environmental state to form a subsequent environmental state and a first reward (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can form a subsequent environmental state and a first reward by implementing a current state of the first action function in dependance on a current environmental state); determining whether to use a second reward based on using the second agent function (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can use the second agent function to determine whether to use a second reward); refining the first agent function in dependence on the first reward when the first determining step has a negative outcome (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can, when the first determining step has a negative outcome, refine the first agent function in dependence on the first reward); computing the second reward according to a predetermined reward function and refining the first agent function in dependence of the first and second rewards when the first determining step has a positive outcome (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can, when the first determining step has a positive outcome, use a predetermined reward function to compute the second reward and refine the first agent function in dependence of the first and second rewards); refining the second agent function in dependence on the performance of the first agent function meeting the predetermined objective (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can, based on the performance of the first agent function in meeting the predetermined objective, refine the second agent function); and adopting the subsequent environmental state as the current environmental state (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can adopt the subsequent state as the current state).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)) or insignificant extra-solution activity (See MPEP 2106.05(g)). The limitations:
“computer”
As drafted, are additional elements that amount to no more than mere instructions to apply the exception for the abstract ideas. See MPEP 2106.05(f). The limitations:
“receiving an initial environment state, an initial state of a first agent function and an initial state of a second agent function”
“outputting the current state of the first agent function as the output value function”
As drafted, are additional elements that correspond to insignificant extra-solution activity. In particular, the additional elements are merely directed towards mere data gathering. See MPEP 2106.05(g). Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe a generic computer for applying the abstract ideas) or insignificant extra-solution activity (i.e. receiving and outputting/transmitting data). Furthermore, the “receiving …” and “outputting …” limitations are insignificant extra-solution activity that is well-understood, routine, and conventional according to MPEP 2106.05(d) (“The courts have recognized the following computer functions as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity… i. Receiving or transmitting data over a network). Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 13,
Claim 13 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 13 is directed to a method, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“forming an output value function for achieving a predetermined objective”
“iteratively learning successive candidates for the output value function”
“(i) in each iteration, a first reward dependent on an environmental state determined by a current state of the output value function”
“(ii) in at least some iterations, a second reward formed by a second value function”
“learning the second value function over successive iterations”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)). The above limitations in the context of this claim encompass forming an output value function for achieving a predetermined objective (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can form an output value function for achieving a predetermined objective); iteratively learning successive candidates for the output value function (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can learn successive candidates for the output value function); determining a first reward dependent on an environmental state based on a current state of the output value function (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can, based on a current state of the output value function, determine a first reward dependent on an environmental state); forming a second reward by using a second value function (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can use a second value function to form a second reward); and learning the second value function over successive iterations (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can learn the second value function over successive iterations).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)) or insignificant extra-solution activity (See MPEP 2106.05(g)). The limitations:
“computer”
As drafted, are additional elements that amount to no more than mere instructions to apply the exception for the abstract ideas. See MPEP 2106.05(f). Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe a generic computer for applying the abstract ideas). Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 14,
Claim 14 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 14 is directed to an apparatus, which is directed to a machine, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“process that input using a function outputted as an output value function by the apparatus of claim 1”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)) and insignificant extra-solution activity (See MPEP 2106.05(g)). The above limitations in the context of this claim encompass processing an input using a function outputted as an output value function (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can use a function output as an output value function from an apparatus to process an input).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)) or insignificant extra-solution activity (See MPEP 2106.05(g)). The limitations:
“computer”
As drafted, are additional elements that amount to no more than mere instructions to apply the exception for the abstract ideas. See MPEP 2106.05(f). The limitations:
“receive an input”
As drafted, are additional elements that correspond to insignificant extra-solution activity. In particular, the additional elements are merely directed towards mere data gathering. See MPEP 2106.05(g). In addition, the recitation of additional elements in claim 1 of generic processors are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. Furthermore, the “receiving …” and “outputting …” limitations of claim 1 are additional elements that correspond to insignificant extra-solution activity as mere data gathering. Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe a generic computer and processors for applying the abstract ideas) or insignificant extra-solution activity (i.e. receiving and outputting/transmitting data). Furthermore, the “receiving …”, “receive …”, and “outputting …” limitations are insignificant extra-solution activity that is well-understood, routine, and conventional according to MPEP 2106.05(d) (“The courts have recognized the following computer functions as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity… i. Receiving or transmitting data over a network). Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 15,
Claim 15 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 15 is directed to an apparatus, which is directed to a machine, one of the statutory categories.
Step 2A Prong One Analysis: Please see the analysis of claim 14. The limitations of claim 15 are only additional elements to the abstract ideas of claim 14.
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)) or insignificant extra-solution activity (See MPEP 2106.05(g)). The limitations:
“computer”
As drafted, are additional elements that amount to no more than mere instructions to apply the exception for the abstract ideas. See MPEP 2106.05(f). The limitations:
“wherein the input is an input sensed from an environment in which the data processing apparatus is located”
As drafted, is part of the insignificant extra-solution activity of claim 14 of receiving an input. The limitation of claim 15 further limits the limitation of claim 14 by further defining what the input comprises. In addition, the recitation of additional elements in claim 14 of a generic computer and processors are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. Furthermore, the “receiving …”, “receive …”, and “outputting …” limitations of claim 14 are additional elements that correspond to insignificant extra-solution activity as mere data gathering. Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe a generic computer and processors for applying the abstract ideas) or insignificant extra-solution activity (i.e. receiving and outputting/transmitting data). Furthermore, the “receiving …”, “receive …”, and “outputting …” limitations are insignificant extra-solution activity that is well-understood, routine, and conventional according to MPEP 2106.05(d) (“The courts have recognized the following computer functions as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity… i. Receiving or transmitting data over a network). Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1-8, 12, and 14-15 are rejected under 35 U.S.C. 103 as being unpatentable over Kvernvik et al. (US 2020/0344682 A1) in view of Van Seijen et al. (US 2018/0165603 A1).
Regarding Claim 1,
Kvernvik et al. teaches a machine learning apparatus, the machine learning apparatus comprising one or more processors configured to: form an output value function for achieving a predetermined objective (Fig. 1; [0035]: "FIG. 1 shows a first wireless device 100 according to some embodiments herein. The first wireless device 100 is connected to a first wireless access point in a first wireless communications network. The first wireless communications network is operated by a first network operator. The first wireless device 100 comprises a processor 102 and a memory 104. The memory 104 contains instructions executable by the processor 102. The first wireless device 100 may be operative to perform the methods described herein" teaches a wireless device (apparatus) comprising a processor 102. Fig. 1; [0038]-[0041]: "the first wireless device 100 is operative to (e.g. adapted to) acquire a determination from a first reinforcement learning agent 106 of whether to roam … As noted above, the use of reinforcement learning agents to determine whether to roam may allow for improved decision making in a moving device … Use of a reinforcement learning agent allows decisions to be updated (e.g. through learning and updating a model associated with the first reinforcement learning agent) dynamically as the environment changes, based on previous decisions (or actions) performed by the first reinforcement learning agent … reinforcement learning is a type of machine learning process whereby a reinforcement learning agent (e.g. algorithm) is used to take decisions (e.g. perform actions) on a system to adjust the system according to an objective (which may, for example, comprise moving the system towards an optimal or preferred state of the system) … a reinforcement learning agent receives an observation from the environment in state S and selects an action to maximize the expected future reward r. Based on the expected future rewards, a value function V for each state can be calculated and an optimal policy rr that maximizes the long term value function can be derived" teaches that the wireless device (apparatus) includes a reinforcement learning (machine learning) agent to calculate (form) a value function V (output value function) to achieve an objective (predetermined objective)) by 
receiving an initial environment state, an initial state of a first agent function, and an initial state of a second agent function ([0041]-[0042]: "a reinforcement learning agent receives an observation from the environment in state S and selects an action to maximize the expected future reward r. Based on the expected future rewards, a value function V for each state can be calculated and an optimal policy rr that maximizes the long term value function can be derived … In the context of this disclosure, the first wireless device and its surroundings (e.g. the system that the first wireless device is within) comprises the “environment” in the state S. The state may comprise the location and/or direction of travel of the first wireless device that may be derived from current and past information about the first wireless device … “Actions” performed by the reinforcement learning agents comprise the decisions or determinations made by the reinforcement agents as to whether a wireless device should roam from a first wireless access point to a second wireless access point. Generally, the reinforcement learning agents herein receive feedback in the form of a reward or credit assignment every time they make a determination (e.g. every time they instigate an action). A reward is allocated depending on the goal of the system" teaches receiving an initial environmental state and an initial observations (states) for the reinforcement learning agents (initial states of agent functions) to make determinations for actions based on calculated rewards. [0125]: "Observations of the environment comprise the observations of the “state”, as explained previously, and also comprise the components that make up the reward function (for example, the reward function or reward may be calculated based on the numerical values of the observations of the environment or numerical values representing the state). The reinforcement learning agent 1012 sits in an application 1010 in the cloud and receives state and reward information" teaches that the initial observations of the environment make up the initial state and initial components of the reward function for the reinforcement learning agents (e.g. initial states of the agent functions). [0074]: "In some examples, as will be discussed in more detail below, the first reinforcement learning agent shares a reward function with (e.g. is rewarded in the same way as) a second reinforcement learning agent, the second reinforcement learning agent being associated with a second wireless device" teaches that the reinforcement learning agents include a first reinforcement learning agent (includes first agent function) and a second reinforcement learning agent (includes second agent function) that share a reward function (e.g. initial observed states are used for the first and second agent functions)); 
iteratively perform the steps of: (i) implementing a current state of the first agent function in dependence on a current environmental state to form a subsequent environmental state and a first reward ([0054]-[0055]: "the determination may be based on (e.g. the input parameters or the current state input to the first reinforcement learning agent may comprise) an indication of a location of the first wireless device … As described above, when a reinforcement learning agent makes a determination (e.g. action), the first reinforcement learning agent receives a reward, based on the change that that action had on the system. In some embodiments, after an action (such as roaming) has been performed, the method may further comprise sending one or more observations (e.g. such as the connectivity or one or more quality metrics associated with the wireless access point that is serving the first wireless device) to the first reinforcement learning agent so that a reward may be determined from the observations. In some embodiments, the first reinforcement learning agent receives a reward by means of a parameter (e.g. a numerical value) indicative of the reward" teaches the first reinforcement learning agent (e.g. including the first agent function) taking a current state input (current environmental state) and determining an action to be performed, then the first reinforcement learning agent receives observations of the environment after performing the action (subsequent environmental state) and a reward based on the change the performed action had on the system (first reward). [0039]: "Use of a reinforcement learning agent allows decisions to be updated (e.g. through learning and updating a model associated with the first reinforcement learning agent) dynamically as the environment changes, based on previous decisions (or actions) performed by the first reinforcement learning agent. This means that as new cells (or wireless access points) are added or existing cells change or are updated, the decision making process may be automatically updated with no human intervention. This may ensure that optimal connectivity is achieved with minimal roaming, even under changing conditions" teaches that the reinforcement learning agents dynamically update (e.g. iteratively) the action decisions as the environment changes based on previous decisions through learning and updating a model associated with the reinforcement learning agents); 
(iv) refining the second agent function in dependence on a performance of the first agent function in meeting the predetermined objective ([0112]-[0114]: "the first reward function may be updated (or defined) through a machine learning process. For example, a machine learning algorithm may be used to determine the most appropriate groupings and/or the most appropriate reward function for wireless devices in a group according to the effect that different values of rewards have on the roaming behavior … many types of machine learning processes may be used to update the first reward function in this manner, including but not limited to the use of unsupervised methods such as clustering (e.g. k-means may be performed on the characteristics of each device) or supervised methods (e.g. such as the use of neural networks), if labelled data is available … The skilled person will appreciate that the teachings above may be applied to more than one group of wireless devices, each group having a different reward function. For example, in some embodiments, the method 800 may further comprise allocating a parameter indicative of a reward to a third reinforcement learning agent based on an action determined by the third reinforcement learning agent for a third wireless device, wherein the third wireless device is part of a second group of wireless devices. In this embodiment, allocating a parameter indicative of a reward to a third reinforcement learning agent may comprise allocating a parameter indicative of a reward using a second reward function, the second reward function being different to the first reward function … the second group of wireless devices may comprise any one of the types of groups of wireless devices listed above for the first group of wireless devices. In this way, rewards may efficiently be allocated to wireless devices in each group to achieve the optimal connectivity according to the needs/requirements of wireless devices in each group" teaches that a second reward function (second agent function) for a second group of wireless devices (e.g. second reinforcement agent) may be updated (refined) through a machine learning process along with (in dependence on) a first reward for a first group of wireless devices (first reinforcement agent) to efficiently allocate rewards for the wireless devices in each group to achieve the objective)); and 
(v) adopting the subsequent environmental state as the current environmental state ([0055]: "when a reinforcement learning agent makes a determination (e.g. action), the first reinforcement learning agent receives a reward, based on the change that that action had on the system. In some embodiments, after an action (such as roaming) has been performed, the method may further comprise sending one or more observations (e.g. such as the connectivity or one or more quality metrics associated with the wireless access point that is serving the first wireless device) to the first reinforcement learning agent so that a reward may be determined from the observations. In some embodiments, the first reinforcement learning agent receives a reward by means of a parameter (e.g. a numerical value) indicative of the reward" teaches the first reinforcement learning agent (e.g. including the first agent function) taking a current state input (current environmental state) and determining an action to be performed, then the first reinforcement learning agent receives observations of the environment after performing the action (subsequent environmental state) (e.g. the subsequent environmental state formed based on the performed action has become the observed current environmental state)); and 
subsequently: outputting the current state of the first agent function as the output value function ([0041]-[0042]: "a reinforcement learning agent receives an observation from the environment in state S and selects an action to maximize the expected future reward r. Based on the expected future rewards, a value function V for each state can be calculated and an optimal policy rr that maximizes the long term value function can be derived … In the context of this disclosure, the first wireless device and its surroundings (e.g. the system that the first wireless device is within) comprises the “environment” in the state S. The state may comprise the location and/or direction of travel of the first wireless device that may be derived from current and past information about the first wireless device … “Actions” performed by the reinforcement learning agents comprise the decisions or determinations made by the reinforcement agents as to whether a wireless device should roam from a first wireless access point to a second wireless access point. Generally, the reinforcement learning agents herein receive feedback in the form of a reward or credit assignment every time they make a determination (e.g. every time they instigate an action). A reward is allocated depending on the goal of the system" teaches that the state of the reinforcement learning agent (current state of the first agent function) is calculated (output) as a value function V (output value function)).
Kvernvik et al. does not appear to explicitly teach (ii) a first determining step comprising determining, using the second agent function, whether to use a second reward; (iii) in a condition where that first determining step has a negative outcome, refining the first agent function in dependence on the first reward; and otherwise in a condition where the first determining step has a positive outcome, computing the second reward according to a predetermined reward function and refining the first agent function in dependence on the first reward and the second reward.
	However, Van Seijen et al. teaches (ii) a first determining step comprising determining, using the second agent function, whether to use a second reward (Fig. 3; [0080]-[0081]: "As illustrated, the environment actions e (as illustrated, e1 through en) of the agents i can be fed into an aggregator function ƒ (as illustrated, ƒ). The aggregator function ƒ maps the environment actions en to an action aflat (as illustrated aflat). From the input space γ, each agent can receive a subset of the input space xi (as illustrated, x.sup.1 through xn). Formally, state space xi of an agent i is a projection of Y:=xflat×C1× . . . ×Cn onto a subspace of Y, such as: xi=σi(Y). … Additionally, each agent can have its own reward function, ri: xi× ai × xi -> R, and a discount factor γi: xi × ai × xi -> [0, 1], and can aim to find a policy πi: xi × ai -> [0,1] that maximizes the return based on these functions. In an example, Πi is defined to be the space of all policies for agent i" teaches that each agent determines its own policy based on the inputs to try to maximize the reward function, each policy being based on computing a value [0,1] to determine whether to use the policy or not, meaning that the reward for each policy is being determined to be used or not (e.g. the policy for the second agent (second agent function) is determined to be used based on computing a value [0,1], meaning the reward for the second agent (second reward) is determined whether to be used based on the policy of the second agent (second agent function)). [0073]: "Each policy π has a corresponding action-value function, qπ(x, a), which gives the expected value of the return G.sub.t conditioned on the state xϵX and action aϵA" teaches that each policy corresponds to an action-value function for the agent); 
(iii) in a condition where that first determining step has a negative outcome, refining the first agent function in dependence on the first reward (Fig. 3; [0081]-[0082]: "Additionally, each agent can have its own reward function, ri: xi× ai × xi -> R, and a discount factor γi: xi × ai × xi -> [0, 1], and can aim to find a policy πi: xi × ai -> [0,1] that maximizes the return based on these functions. In an example, Πi is defined to be the space of all policies for agent I … Given a learning method that converges to the optimal policy on a single-agent MDP task, applying this method independently to each of the agents of the SoC model, the overall policy of the SoC model converges to a fixed point" teaches that each agent can use a learning method to determine its optimal policy (e.g. refine its agent function) based on maximizing the reward (e.g. in dependence on the reward) and that the learning method can be applied to each agent independently (e.g. doesn't depend on rewards from other agents) (i.e. the first agent policy (first agent function) uses a learning method to converge (refine) the optimal policy based on the first reward without using a second reward from a second agent)); and 
otherwise in a condition where the first determining step has a positive outcome, computing the second reward according to a predetermined reward function and refining the first agent function in dependence on the first reward and the second reward (Fig. 2; Fig. 3; [0073]-[0075]: "Actions a are taken at discrete time steps according to policy π, which maps states to actions. For example, actions a may be taken at discrete time steps t=0, 1, 2, . . . according to a policy π: X×A ->[0,1], which defines for each action the selection probability conditioned on the state. Each policy π has a corresponding action-value function, qπ(x, a), which gives the expected value of the return Gt conditioned on the state xϵX and action aϵA: qπ(x, a)=E{Gt|Xt=x, At=a, π}. A goal is to maximize the discounted sum of rewards, also referred to as the return: Gt:=Σk−1∞γk−1Rt+k … FIG. 2 illustrates an example SoC model for taking actions with respect to an environment (illustrated as Environment). From the perspective of the environment, the SoC model can act no different from flat agent: the model takes an action A (as illustrated, A) with respect to the environment and can receive a state X (as illustrated, X) of the environment. But beyond this perspective, the illustrated SoC model includes two agents illustrated as Agent 1 and Agent 2. An example task can be expanded into a system of communicating agents as follows. For each agent i (as illustrated, Agent 1 and Agent 2), an environment action-set Bi is defined (as illustrated, B1 and B2), as well as a communication action-set Ci (as illustrated, C1 and C2), and a learning objective. The learning objective can be defined by a reward function, ri, plus a discount factor, γi. An action-mapping function, ƒ: B1× . . . ×Bn -> a, which maps the joint environment-action space to an action of the flat agent, is also defined (as illustrated, ƒ). The agents share a common state-space Y (as illustrated, the dashed ellipse marked with Y) including the state-space of the flat agent plus the joint communication actions: Y:=x×C1× . . . ×Cn … At time t, each agent i, observes state Yt:=(Xt, ct−11, . . . , ct−1n)ϵY. At each time t, each agent i can also select environment action Bti and communication action cti ϵ ci, according to policy πi: Y -> Bi × Ci. Action at=ƒ(Bti, . . . Btn) is fed to the environment, which responds with an updated state xt+1. The environment also produces a reward Rt+1. In some examples, this reward is only used to measure the overall performance of the SoC model. For learning, each agent i uses its own reward function, ri: Y × Bi × Ci × Y -> R, to compute overall reward, Rt+1i=ri(Yt, Bt.i, cti, Yt+1)" teaches that learning (refining) the policy corresponding to an action-value function (agent function) is based on (in dependence on) the sum of rewards for the agents, wherein each agent computes its own reward (i.e. the second agent computes a second reward) and the learning for each agent is based on a reward function that computes an overall reward (i.e. the first agent function is refined based on the sum of the first reward from the first agent and the second reward from the second agent when it is determined to use the second reward)).
	Kvernvik et al. and Van Seijen et al. are analogous to the claimed invention because they are directed to the implementation of reinforcement learning for value functions.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate (ii) a first determining step comprising determining, using the second agent function, whether to use a second reward; (iii) in a condition where that first determining step has a negative outcome, refining the first agent function in dependence on the first reward; and otherwise in a condition where the first determining step has a positive outcome, computing the second reward according to a predetermined reward function and refining the first agent function in dependence on the first reward and the second reward as taught by Van Seijen et al. to the disclosed invention of Kvernvik et al.
One of ordinary skill in the art would have been motivated to make this modification "because each component typically depends only on a subset of all features, the overall value function is much smoother and can be more easily approximated by a low-dimensional representation, enabling more effective learning" (Van Seijen et al. [0010]).
Regarding Claim 2,
Kvernvik et al. in view of Van Seijen et al. teaches the machine learning apparatus of claim 1.
In addition, Van Seijen et al. further teaches wherein the first determining step comprises computing a binary value representing whether or not to use the second reward (Fig. 3; [0080]-[0081]: "As illustrated, the environment actions e (as illustrated, e1 through en) of the agents i can be fed into an aggregator function ƒ (as illustrated, ƒ). The aggregator function ƒ maps the environment actions en to an action aflat (as illustrated aflat). From the input space γ, each agent can receive a subset of the input space xi (as illustrated, x.sup.1 through xn). Formally, state space xi of an agent i is a projection of Y:=xflat×C1× . . . ×Cn onto a subspace of Y, such as: xi=σi(Y). … Additionally, each agent can have its own reward function, ri: xi× ai × xi -> R, and a discount factor γi: xi × ai × xi -> [0, 1], and can aim to find a policy πi: xi × ai -> [0,1] that maximizes the return based on these functions. In an example, Πi is defined to be the space of all policies for agent i" teaches that each agent determines its own policy based on the inputs to try to maximize the reward function, each policy being based on computing a binary value [0,1] to determine whether to use the policy or not, meaning that the reward for each policy is being determined to be used or not (e.g. the policy for the second agent (second agent function) is determined to be used based on computing a binary value [0,1], meaning the reward for the second agent (second reward) is determined whether to be used based on the policy of the second agent (second agent function))).
	Kvernvik et al. and Van Seijen et al. are analogous to the claimed invention because they are directed to the implementation of reinforcement learning for value functions.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the first determining step comprises computing a binary value representing whether or not to use the second reward as taught by Van Seijen et al. to the disclosed invention of Kvernvik et al.
One of ordinary skill in the art would have been motivated to make this modification "because each component typically depends only on a subset of all features, the overall value function is much smoother and can be more easily approximated by a low-dimensional representation, enabling more effective learning" (Van Seijen et al. [0010]).
Regarding Claim 3,
Kvernvik et al. in view of Van Seijen et al. teaches the machine learning apparatus of claim 1.
In addition, Kvernvik et al. further teaches wherein the step of refining the second agent function is performed in dependence on an objective function, which comprises a negative cost element upon determining, on a respective iteration, that the determination of whether to use the second reward has a positive outcome ([0102]: "the step of allocating 802 may comprise allocating a parameter indicative of a negative reward (e.g. the first reinforcement learning agent receives negative feedback) when i) when the first wireless device roams to the second wireless access point in the second network ii) roaming to the second wireless access point decreases the connectivity of the first wireless device or iii) roaming leads to a loss of connectivity of the first wireless device and/or iv) when an inter-network operator handover procedure is performed" teaches that a parameter indicative of a negative reward (negative cost element) is allocated to the reward function when it is determined to use a second wireless device (e.g. when it is determined to use a second reward)).
Regarding Claim 4,
Kvernvik et al. in view of Van Seijen et al. teaches the machine learning apparatus of claim 1.
In addition, Kvernvik et al. further teaches wherein the step of refining the second agent function comprises a second determining step comprising: determining whether the subsequent environmental state formed on the respective iteration is in a set of relatively infrequently visited states ([0105]: "a reward function may indicate that a reward of: i) “−0.1” should be allocated to the first reinforcement agent each time that a determination (e.g. action) results in roaming of the first wireless device; ii) “−1” if a transfer of service from one wireless access point to another results in the first wireless device losing coverage (e.g. if the first wireless device enters a black hole); iii) “−0.01” every time service is transferred from one wireless access point to another; and iv)+4 when the first wireless device reaches its destination. In this example, loss of connectivity is therefore the most highly penalized action. The skilled person will appreciate that these values are merely examples however and that the first reward function may comprise any other combination of rewards and reward values, the rewards and relative reward values being tuned according to the (optimization) goal" teaches that the reward function for a wireless device (e.g. second agent function for second reinforcement agent) is determined whether the determined action results in a state (subsequent environmental state) that has arrived at its destination (relatively infrequently visited state) (i.e. a state where it has reached its destination is relatively infrequent when compared to all other states that are at its initial state or in transit)), and 
wherein the step of refining the second agent function is performed in dependence on an objective function, which comprises a positive reward element in a condition where, on a respective iteration, that the second determining step has a positive outcome ([0105]: "a reward function may indicate that a reward of: i) “−0.1” should be allocated to the first reinforcement agent each time that a determination (e.g. action) results in roaming of the first wireless device; ii) “−1” if a transfer of service from one wireless access point to another results in the first wireless device losing coverage (e.g. if the first wireless device enters a black hole); iii) “−0.01” every time service is transferred from one wireless access point to another; and iv)+4 when the first wireless device reaches its destination. In this example, loss of connectivity is therefore the most highly penalized action. The skilled person will appreciate that these values are merely examples however and that the first reward function may comprise any other combination of rewards and reward values, the rewards and relative reward values being tuned according to the (optimization) goal" teaches that the reward function for a wireless device (e.g. second agent function for second reinforcement agent) when it is determined that the action results in a state (subsequent environmental state) that has arrived at its destination (i.e. second determining step has a positive outcome) has a positive reward element).
Regarding Claim 5,
Kvernvik et al. in view of Van Seijen et al. teaches the machine learning apparatus of claim 1.
In addition, Van Seijen et al. further teaches wherein the one or more processors are configured to, in a condition where the outcome of the first determining step is positive, refine the first agent function in dependence on the sum of the first reward and the second reward (Fig. 2; Fig. 3; [0073]-[0075]: "Actions a are taken at discrete time steps according to policy π, which maps states to actions. For example, actions a may be taken at discrete time steps t=0, 1, 2, . . . according to a policy π: X×A ->[0,1], which defines for each action the selection probability conditioned on the state. Each policy π has a corresponding action-value function, qπ(x, a), which gives the expected value of the return Gt conditioned on the state xϵX and action aϵA: qπ(x, a)=E{Gt|Xt=x, At=a, π}. A goal is to maximize the discounted sum of rewards, also referred to as the return: Gt:=Σk−1∞γk−1Rt+k … FIG. 2 illustrates an example SoC model for taking actions with respect to an environment (illustrated as Environment). From the perspective of the environment, the SoC model can act no different from flat agent: the model takes an action A (as illustrated, A) with respect to the environment and can receive a state X (as illustrated, X) of the environment. But beyond this perspective, the illustrated SoC model includes two agents illustrated as Agent 1 and Agent 2. An example task can be expanded into a system of communicating agents as follows. For each agent i (as illustrated, Agent 1 and Agent 2), an environment action-set Bi is defined (as illustrated, B1 and B2), as well as a communication action-set Ci (as illustrated, C1 and C2), and a learning objective. The learning objective can be defined by a reward function, ri, plus a discount factor, γi. An action-mapping function, ƒ: B1× . . . ×Bn -> a, which maps the joint environment-action space to an action of the flat agent, is also defined (as illustrated, ƒ). The agents share a common state-space Y (as illustrated, the dashed ellipse marked with Y) including the state-space of the flat agent plus the joint communication actions: Y:=x×C1× . . . ×Cn … At time t, each agent i, observes state Yt:=(Xt, ct−11, . . . , ct−1n)ϵY. At each time t, each agent i can also select environment action Bti and communication action cti ϵ ci, according to policy πi: Y -> Bi × Ci. Action at=ƒ(Bti, . . . Btn) is fed to the environment, which responds with an updated state xt+1. The environment also produces a reward Rt+1. In some examples, this reward is only used to measure the overall performance of the SoC model. For learning, each agent i uses its own reward function, ri: Y × Bi × Ci × Y -> R, to compute overall reward, Rt+1i=ri(Yt, Bt.i, cti, Yt+1)" teaches that learning (refining) the policy corresponding to an action-value function (agent function) is based on (in dependence on) the sum of rewards for the agents, wherein the learning for each agent is based on a reward function that computes an overall reward (i.e. the first agent function is refined based on the sum of the first reward from the first agent and the second reward from the second agent when it is determined to use the second reward)).
	Kvernvik et al. and Van Seijen et al. are analogous to the claimed invention because they are directed to the implementation of reinforcement learning for value functions.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the one or more processors are configured to, in a condition where the outcome of the first determining step is positive, refine the first agent function in dependence on the sum of the first reward and the second reward as taught by Van Seijen et al. to the disclosed invention of Kvernvik et al.
One of ordinary skill in the art would have been motivated to make this modification "because each component typically depends only on a subset of all features, the overall value function is much smoother and can be more easily approximated by a low-dimensional representation, enabling more effective learning" (Van Seijen et al. [0010]).
Regarding Claim 6,
Kvernvik et al. in view of Van Seijen et al. teaches the machine learning apparatus of claim 5.
In addition, Van Seijen et al. further teaches wherein the reward function is such that summing the first reward and the second reward preserves pursuit of the objective (Fig. 2; Fig. 3; [0073]-[0075]: "Actions a are taken at discrete time steps according to policy π, which maps states to actions. For example, actions a may be taken at discrete time steps t=0, 1, 2, . . . according to a policy π: X×A ->[0,1], which defines for each action the selection probability conditioned on the state. Each policy π has a corresponding action-value function, qπ(x, a), which gives the expected value of the return Gt conditioned on the state xϵX and action aϵA: qπ(x, a)=E{Gt|Xt=x, At=a, π}. A goal is to maximize the discounted sum of rewards, also referred to as the return: Gt:=Σk−1∞γk−1Rt+k … FIG. 2 illustrates an example SoC model for taking actions with respect to an environment (illustrated as Environment). From the perspective of the environment, the SoC model can act no different from flat agent: the model takes an action A (as illustrated, A) with respect to the environment and can receive a state X (as illustrated, X) of the environment. But beyond this perspective, the illustrated SoC model includes two agents illustrated as Agent 1 and Agent 2. An example task can be expanded into a system of communicating agents as follows. For each agent i (as illustrated, Agent 1 and Agent 2), an environment action-set Bi is defined (as illustrated, B1 and B2), as well as a communication action-set Ci (as illustrated, C1 and C2), and a learning objective. The learning objective can be defined by a reward function, ri, plus a discount factor, γi. An action-mapping function, ƒ: B1× . . . ×Bn -> a, which maps the joint environment-action space to an action of the flat agent, is also defined (as illustrated, ƒ). The agents share a common state-space Y (as illustrated, the dashed ellipse marked with Y) including the state-space of the flat agent plus the joint communication actions: Y:=x×C1× . . . ×Cn … At time t, each agent i, observes state Yt:=(Xt, ct−11, . . . , ct−1n)ϵY. At each time t, each agent i can also select environment action Bti and communication action cti ϵ ci, according to policy πi: Y -> Bi × Ci. Action at=ƒ(Bti, . . . Btn) is fed to the environment, which responds with an updated state xt+1. The environment also produces a reward Rt+1. In some examples, this reward is only used to measure the overall performance of the SoC model. For learning, each agent i uses its own reward function, ri: Y × Bi × Ci × Y -> R, to compute overall reward, Rt+1i=ri(Yt, Bt.i, cti, Yt+1)" teaches that goal (objective) of the policy corresponding to an action-value function (agent function) is based on maximizing the sum of rewards for the agents, wherein the learning objective for each agent is based on a reward function that computes an overall reward (i.e. the pursuit of the learning objective is based on the sum of the first reward from the first agent and the second reward from the second agent)).
	Kvernvik et al. and Van Seijen et al. are analogous to the claimed invention because they are directed to the implementation of reinforcement learning for value functions.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the reward function is such that summing the first reward and the second reward preserves pursuit of the objective as taught by Van Seijen et al. to the disclosed invention of Kvernvik et al.
One of ordinary skill in the art would have been motivated to make this modification "because each component typically depends only on a subset of all features, the overall value function is much smoother and can be more easily approximated by a low-dimensional representation, enabling more effective learning" (Van Seijen et al. [0010]).
Regarding Claim 7,
Kvernvik et al. in view of Van Seijen et al. teaches the machine learning apparatus of claim 1.
In addition, Van Seijen et al. further teaches wherein the one or more processors are configured to, on each iteration, compute the second reward only in a condition where the outcome of the first determining step is positive (Fig. 3; [0080]-[0081]: "As illustrated, the environment actions e (as illustrated, e1 through en) of the agents i can be fed into an aggregator function ƒ (as illustrated, ƒ). The aggregator function ƒ maps the environment actions en to an action aflat (as illustrated aflat). From the input space γ, each agent can receive a subset of the input space xi (as illustrated, x.sup.1 through xn). Formally, state space xi of an agent i is a projection of Y:=xflat×C1× . . . ×Cn onto a subspace of Y, such as: xi=σi(Y). … Additionally, each agent can have its own reward function, ri: xi× ai × xi -> R, and a discount factor γi: xi × ai × xi -> [0, 1], and can aim to find a policy πi: xi × ai -> [0,1] that maximizes the return based on these functions. In an example, Πi is defined to be the space of all policies for agent i" teaches that each agent determines its own policy based on the inputs to try to maximize the reward function, each policy being based on computing a value [0,1] to determine whether to use the policy or not, meaning that the reward for each policy is only computed when the policy is being used (e.g. the policy for the second agent (second agent function) is determined to be used based on computing a value [0,1], meaning the reward for the second agent (second reward) is computed only when the policy of the second agent (second agent function) is used (i.e. first determining step is positive))).
	Kvernvik et al. and Van Seijen et al. are analogous to the claimed invention because they are directed to the implementation of reinforcement learning for value functions.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the one or more processors are configured to, on each iteration, compute the second reward only in a condition where the outcome of the first determining step is positive as taught by Van Seijen et al. to the disclosed invention of Kvernvik et al.
One of ordinary skill in the art would have been motivated to make this modification "because each component typically depends only on a subset of all features, the overall value function is much smoother and can be more easily approximated by a low-dimensional representation, enabling more effective learning" (Van Seijen et al. [0010]).
Regarding Claim 8,
Kvernvik et al. in view of Van Seijen et al. teaches the machine learning apparatus of claim 1.
In addition, Kvernvik et al. further teaches wherein the first reward is determined in dependence on the subsequent environmental state ([0054]-[0055]: "the determination may be based on (e.g. the input parameters or the current state input to the first reinforcement learning agent may comprise) an indication of a location of the first wireless device … As described above, when a reinforcement learning agent makes a determination (e.g. action), the first reinforcement learning agent receives a reward, based on the change that that action had on the system. In some embodiments, after an action (such as roaming) has been performed, the method may further comprise sending one or more observations (e.g. such as the connectivity or one or more quality metrics associated with the wireless access point that is serving the first wireless device) to the first reinforcement learning agent so that a reward may be determined from the observations. In some embodiments, the first reinforcement learning agent receives a reward by means of a parameter (e.g. a numerical value) indicative of the reward" teaches that the reward for the first reinforcement learning agent (e.g. the first reward) is determined based on observations of the environment after performing the action (subsequent environmental state)).
Regarding Claim 12,
Kvernvik et al. teaches a computer-implemented machine learning method for forming an output value function (Fig. 1; Fig. 2; [0050]: "FIG. 2 illustrates a method 200 that may be performed by a first wireless device, such as the first wireless device 100 described with respect to FIG. 1. In a first step 202, the method comprises acquiring a determination from a first reinforcement learning agent of whether to roam" teaches a method performed by a wireless device (computer) using a reinforcement learning (machine learning) agent. Fig. 1; [0038]-[0041]: "the first wireless device 100 is operative to (e.g. adapted to) acquire a determination from a first reinforcement learning agent 106 of whether to roam … As noted above, the use of reinforcement learning agents to determine whether to roam may allow for improved decision making in a moving device … Use of a reinforcement learning agent allows decisions to be updated (e.g. through learning and updating a model associated with the first reinforcement learning agent) dynamically as the environment changes, based on previous decisions (or actions) performed by the first reinforcement learning agent … reinforcement learning is a type of machine learning process whereby a reinforcement learning agent (e.g. algorithm) is used to take decisions (e.g. perform actions) on a system to adjust the system according to an objective (which may, for example, comprise moving the system towards an optimal or preferred state of the system) … a reinforcement learning agent receives an observation from the environment in state S and selects an action to maximize the expected future reward r. Based on the expected future rewards, a value function V for each state can be calculated and an optimal policy rr that maximizes the long term value function can be derived" teaches that the wireless device (apparatus) includes a reinforcement learning (machine learning) agent to calculate (form) a value function V (output value function) to achieve an objective (predetermined objective)), the method comprising: 
receiving an initial environment state, an initial state of a first agent function and an initial state of a second agent function ([0041]-[0042]: "a reinforcement learning agent receives an observation from the environment in state S and selects an action to maximize the expected future reward r. Based on the expected future rewards, a value function V for each state can be calculated and an optimal policy rr that maximizes the long term value function can be derived … In the context of this disclosure, the first wireless device and its surroundings (e.g. the system that the first wireless device is within) comprises the “environment” in the state S. The state may comprise the location and/or direction of travel of the first wireless device that may be derived from current and past information about the first wireless device … “Actions” performed by the reinforcement learning agents comprise the decisions or determinations made by the reinforcement agents as to whether a wireless device should roam from a first wireless access point to a second wireless access point. Generally, the reinforcement learning agents herein receive feedback in the form of a reward or credit assignment every time they make a determination (e.g. every time they instigate an action). A reward is allocated depending on the goal of the system" teaches receiving an initial environmental state and an initial observations (states) for the reinforcement learning agents (initial states of agent functions) to make determinations for actions based on calculated rewards. [0125]: "Observations of the environment comprise the observations of the “state”, as explained previously, and also comprise the components that make up the reward function (for example, the reward function or reward may be calculated based on the numerical values of the observations of the environment or numerical values representing the state). The reinforcement learning agent 1012 sits in an application 1010 in the cloud and receives state and reward information" teaches that the initial observations of the environment make up the initial state and initial components of the reward function for the reinforcement learning agents (e.g. initial states of the agent functions). [0074]: "In some examples, as will be discussed in more detail below, the first reinforcement learning agent shares a reward function with (e.g. is rewarded in the same way as) a second reinforcement learning agent, the second reinforcement learning agent being associated with a second wireless device" teaches that the reinforcement learning agents include a first reinforcement learning agent (includes first agent function) and a second reinforcement learning agent (includes second agent function) that share a reward function (e.g. initial observed states are used for the first and second agent functions)); 
iteratively performing the steps of: (i) implementing a current state of the first agent function in dependence on a current environmental state to form a subsequent environmental state and a first reward ([0054]-[0055]: "the determination may be based on (e.g. the input parameters or the current state input to the first reinforcement learning agent may comprise) an indication of a location of the first wireless device … As described above, when a reinforcement learning agent makes a determination (e.g. action), the first reinforcement learning agent receives a reward, based on the change that that action had on the system. In some embodiments, after an action (such as roaming) has been performed, the method may further comprise sending one or more observations (e.g. such as the connectivity or one or more quality metrics associated with the wireless access point that is serving the first wireless device) to the first reinforcement learning agent so that a reward may be determined from the observations. In some embodiments, the first reinforcement learning agent receives a reward by means of a parameter (e.g. a numerical value) indicative of the reward" teaches the first reinforcement learning agent (e.g. including the first agent function) taking a current state input (current environmental state) and determining an action to be performed, then the first reinforcement learning agent receives observations of the environment after performing the action (subsequent environmental state) and a reward based on the change the performed action had on the system (first reward). [0039]: "Use of a reinforcement learning agent allows decisions to be updated (e.g. through learning and updating a model associated with the first reinforcement learning agent) dynamically as the environment changes, based on previous decisions (or actions) performed by the first reinforcement learning agent. This means that as new cells (or wireless access points) are added or existing cells change or are updated, the decision making process may be automatically updated with no human intervention. This may ensure that optimal connectivity is achieved with minimal roaming, even under changing conditions" teaches that the reinforcement learning agents dynamically update (e.g. iteratively) the action decisions as the environment changes based on previous decisions through learning and updating a model associated with the reinforcement learning agents); 
(iv) refining the second agent function in dependence on a performance of the first agent function in meeting the predetermined objective ([0112]-[0114]: "the first reward function may be updated (or defined) through a machine learning process. For example, a machine learning algorithm may be used to determine the most appropriate groupings and/or the most appropriate reward function for wireless devices in a group according to the effect that different values of rewards have on the roaming behavior … many types of machine learning processes may be used to update the first reward function in this manner, including but not limited to the use of unsupervised methods such as clustering (e.g. k-means may be performed on the characteristics of each device) or supervised methods (e.g. such as the use of neural networks), if labelled data is available … The skilled person will appreciate that the teachings above may be applied to more than one group of wireless devices, each group having a different reward function. For example, in some embodiments, the method 800 may further comprise allocating a parameter indicative of a reward to a third reinforcement learning agent based on an action determined by the third reinforcement learning agent for a third wireless device, wherein the third wireless device is part of a second group of wireless devices. In this embodiment, allocating a parameter indicative of a reward to a third reinforcement learning agent may comprise allocating a parameter indicative of a reward using a second reward function, the second reward function being different to the first reward function … the second group of wireless devices may comprise any one of the types of groups of wireless devices listed above for the first group of wireless devices. In this way, rewards may efficiently be allocated to wireless devices in each group to achieve the optimal connectivity according to the needs/requirements of wireless devices in each group" teaches that a second reward function (second agent function) for a second group of wireless devices (e.g. second reinforcement agent) may be updated (refined) through a machine learning process along with (in dependence on) a first reward for a first group of wireless devices (first reinforcement agent) to efficiently allocate rewards for the wireless devices in each group to achieve the objective)); and 
(v) adopting the subsequent environmental state as the current environmental state ([0055]: "when a reinforcement learning agent makes a determination (e.g. action), the first reinforcement learning agent receives a reward, based on the change that that action had on the system. In some embodiments, after an action (such as roaming) has been performed, the method may further comprise sending one or more observations (e.g. such as the connectivity or one or more quality metrics associated with the wireless access point that is serving the first wireless device) to the first reinforcement learning agent so that a reward may be determined from the observations. In some embodiments, the first reinforcement learning agent receives a reward by means of a parameter (e.g. a numerical value) indicative of the reward" teaches the first reinforcement learning agent (e.g. including the first agent function) taking a current state input (current environmental state) and determining an action to be performed, then the first reinforcement learning agent receives observations of the environment after performing the action (subsequent environmental state) (e.g. the subsequent environmental state formed based on the performed action has become the observed current environmental state)); and 
subsequently: outputting the current state of the first agent function as the output value function ([0041]-[0042]: "a reinforcement learning agent receives an observation from the environment in state S and selects an action to maximize the expected future reward r. Based on the expected future rewards, a value function V for each state can be calculated and an optimal policy rr that maximizes the long term value function can be derived … In the context of this disclosure, the first wireless device and its surroundings (e.g. the system that the first wireless device is within) comprises the “environment” in the state S. The state may comprise the location and/or direction of travel of the first wireless device that may be derived from current and past information about the first wireless device … “Actions” performed by the reinforcement learning agents comprise the decisions or determinations made by the reinforcement agents as to whether a wireless device should roam from a first wireless access point to a second wireless access point. Generally, the reinforcement learning agents herein receive feedback in the form of a reward or credit assignment every time they make a determination (e.g. every time they instigate an action). A reward is allocated depending on the goal of the system" teaches that the state of the reinforcement learning agent (current state of the first agent function) is calculated (output) as a value function V (output value function)).
Kvernvik et al. does not appear to explicitly teach (ii) a first determining step comprising determining, using the second agent function, whether to use a second reward; (iii) in a condition where the first determining step has a negative outcome, refining the first agent function in dependence on the first reward; and otherwise in a condition where the first determining step has a positive outcome, computing the second reward according to a predetermined reward function and refining the first agent function in dependence on the first reward and the second reward.
	However, Van Seijen et al. teaches (ii) a first determining step comprising determining, using the second agent function, whether to use a second reward (Fig. 3; [0080]-[0081]: "As illustrated, the environment actions e (as illustrated, e1 through en) of the agents i can be fed into an aggregator function ƒ (as illustrated, ƒ). The aggregator function ƒ maps the environment actions en to an action aflat (as illustrated aflat). From the input space γ, each agent can receive a subset of the input space xi (as illustrated, x.sup.1 through xn). Formally, state space xi of an agent i is a projection of Y:=xflat×C1× . . . ×Cn onto a subspace of Y, such as: xi=σi(Y). … Additionally, each agent can have its own reward function, ri: xi× ai × xi -> R, and a discount factor γi: xi × ai × xi -> [0, 1], and can aim to find a policy πi: xi × ai -> [0,1] that maximizes the return based on these functions. In an example, Πi is defined to be the space of all policies for agent i" teaches that each agent determines its own policy based on the inputs to try to maximize the reward function, each policy being based on computing a value [0,1] to determine whether to use the policy or not, meaning that the reward for each policy is being determined to be used or not (e.g. the policy for the second agent (second agent function) is determined to be used based on computing a value [0,1], meaning the reward for the second agent (second reward) is determined whether to be used based on the policy of the second agent (second agent function)). [0073]: "Each policy π has a corresponding action-value function, qπ(x, a), which gives the expected value of the return G.sub.t conditioned on the state xϵX and action aϵA" teaches that each policy corresponds to an action-value function for the agent); 
(iii) in a condition where the first determining step has a negative outcome, refining the first agent function in dependence on the first reward (Fig. 3; [0081]-[0082]: "Additionally, each agent can have its own reward function, ri: xi× ai × xi -> R, and a discount factor γi: xi × ai × xi -> [0, 1], and can aim to find a policy πi: xi × ai -> [0,1] that maximizes the return based on these functions. In an example, Πi is defined to be the space of all policies for agent I … Given a learning method that converges to the optimal policy on a single-agent MDP task, applying this method independently to each of the agents of the SoC model, the overall policy of the SoC model converges to a fixed point" teaches that each agent can use a learning method to determine its optimal policy (e.g. refine its agent function) based on maximizing the reward (e.g. in dependence on the reward) and that the learning method can be applied to each agent independently (e.g. doesn't depend on rewards from other agents) (i.e. the first agent policy (first agent function) uses a learning method to converge (refine) the optimal policy based on the first reward without using a second reward from a second agent)); and 
otherwise in a condition where the first determining step has a positive outcome, computing the second reward according to a predetermined reward function and refining the first agent function in dependence on the first reward and the second reward (Fig. 2; Fig. 3; [0073]-[0075]: "Actions a are taken at discrete time steps according to policy π, which maps states to actions. For example, actions a may be taken at discrete time steps t=0, 1, 2, . . . according to a policy π: X×A ->[0,1], which defines for each action the selection probability conditioned on the state. Each policy π has a corresponding action-value function, qπ(x, a), which gives the expected value of the return Gt conditioned on the state xϵX and action aϵA: qπ(x, a)=E{Gt|Xt=x, At=a, π}. A goal is to maximize the discounted sum of rewards, also referred to as the return: Gt:=Σk−1∞γk−1Rt+k … FIG. 2 illustrates an example SoC model for taking actions with respect to an environment (illustrated as Environment). From the perspective of the environment, the SoC model can act no different from flat agent: the model takes an action A (as illustrated, A) with respect to the environment and can receive a state X (as illustrated, X) of the environment. But beyond this perspective, the illustrated SoC model includes two agents illustrated as Agent 1 and Agent 2. An example task can be expanded into a system of communicating agents as follows. For each agent i (as illustrated, Agent 1 and Agent 2), an environment action-set Bi is defined (as illustrated, B1 and B2), as well as a communication action-set Ci (as illustrated, C1 and C2), and a learning objective. The learning objective can be defined by a reward function, ri, plus a discount factor, γi. An action-mapping function, ƒ: B1× . . . ×Bn -> a, which maps the joint environment-action space to an action of the flat agent, is also defined (as illustrated, ƒ). The agents share a common state-space Y (as illustrated, the dashed ellipse marked with Y) including the state-space of the flat agent plus the joint communication actions: Y:=x×C1× . . . ×Cn … At time t, each agent i, observes state Yt:=(Xt, ct−11, . . . , ct−1n)ϵY. At each time t, each agent i can also select environment action Bti and communication action cti ϵ ci, according to policy πi: Y -> Bi × Ci. Action at=ƒ(Bti, . . . Btn) is fed to the environment, which responds with an updated state xt+1. The environment also produces a reward Rt+1. In some examples, this reward is only used to measure the overall performance of the SoC model. For learning, each agent i uses its own reward function, ri: Y × Bi × Ci × Y -> R, to compute overall reward, Rt+1i=ri(Yt, Bt.i, cti, Yt+1)" teaches that learning (refining) the policy corresponding to an action-value function (agent function) is based on (in dependence on) the sum of rewards for the agents, wherein each agent computes its own reward (i.e. the second agent computes a second reward) and the learning for each agent is based on a reward function that computes an overall reward (i.e. the first agent function is refined based on the sum of the first reward from the first agent and the second reward from the second agent when it is determined to use the second reward)).
	Kvernvik et al. and Van Seijen et al. are analogous to the claimed invention because they are directed to the implementation of reinforcement learning for value functions.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate (ii) a first determining step comprising determining, using the second agent function, whether to use a second reward; (iii) in a condition where the first determining step has a negative outcome, refining the first agent function in dependence on the first reward; and otherwise in a condition where the first determining step has a positive outcome, computing the second reward according to a predetermined reward function and refining the first agent function in dependence on the first reward and the second reward as taught by Van Seijen et al. to the disclosed invention of Kvernvik et al.
One of ordinary skill in the art would have been motivated to make this modification "because each component typically depends only on a subset of all features, the overall value function is much smoother and can be more easily approximated by a low-dimensional representation, enabling more effective learning" (Van Seijen et al. [0010]).
Regarding Claim 14,
Kvernvik et al. in view of Van Seijen et al. teaches the apparatus of claim 1.
In addition, Kvernvik et al. further teaches a computer-implemented data processing apparatus configured to receive an input and process that input using a function outputted as an output value function (Fig. 1; [0035]: "FIG. 1 shows a first wireless device 100 according to some embodiments herein. The first wireless device 100 is connected to a first wireless access point in a first wireless communications network. The first wireless communications network is operated by a first network operator. The first wireless device 100 comprises a processor 102 and a memory 104. The memory 104 contains instructions executable by the processor 102. The first wireless device 100 may be operative to perform the methods described herein" teaches a wireless device (apparatus) comprising a processor 102. Fig. 1; [0038]-[0041]: "the first wireless device 100 is operative to (e.g. adapted to) acquire a determination from a first reinforcement learning agent 106 of whether to roam … As noted above, the use of reinforcement learning agents to determine whether to roam may allow for improved decision making in a moving device … Use of a reinforcement learning agent allows decisions to be updated (e.g. through learning and updating a model associated with the first reinforcement learning agent) dynamically as the environment changes, based on previous decisions (or actions) performed by the first reinforcement learning agent … reinforcement learning is a type of machine learning process whereby a reinforcement learning agent (e.g. algorithm) is used to take decisions (e.g. perform actions) on a system to adjust the system according to an objective (which may, for example, comprise moving the system towards an optimal or preferred state of the system) … a reinforcement learning agent receives an observation from the environment in state S and selects an action to maximize the expected future reward r. Based on the expected future rewards, a value function V for each state can be calculated and an optimal policy rr that maximizes the long term value function can be derived" teaches that the wireless device (apparatus) includes a reinforcement learning (machine learning) agent that receives an observation of the environment in state S (receives an input) and calculates a value function for the state (processes the input using an output value function)).
Regarding Claim 15,
Kvernvik et al. in view of Van Seijen et al. teaches the computer-implemented data processing apparatus of claim 14.
In addition, Kvernvik et al. further teaches wherein the input is an input sensed from an environment in which the data processing apparatus is located (Fig. 1; [0038]-[0042]: "the first wireless device 100 is operative to (e.g. adapted to) acquire a determination from a first reinforcement learning agent 106 of whether to roam … As noted above, the use of reinforcement learning agents to determine whether to roam may allow for improved decision making in a moving device … Use of a reinforcement learning agent allows decisions to be updated (e.g. through learning and updating a model associated with the first reinforcement learning agent) dynamically as the environment changes, based on previous decisions (or actions) performed by the first reinforcement learning agent … reinforcement learning is a type of machine learning process whereby a reinforcement learning agent (e.g. algorithm) is used to take decisions (e.g. perform actions) on a system to adjust the system according to an objective (which may, for example, comprise moving the system towards an optimal or preferred state of the system) … a reinforcement learning agent receives an observation from the environment in state S and selects an action to maximize the expected future reward r. Based on the expected future rewards, a value function V for each state can be calculated and an optimal policy rr that maximizes the long term value function can be derived … In the context of this disclosure, the first wireless device and its surroundings (e.g. the system that the first wireless device is within) comprises the “environment” in the state S. The state may comprise the location and/or direction of travel of the first wireless device that may be derived from current and past information about the first wireless device … “Actions” performed by the reinforcement learning agents comprise the decisions or determinations made by the reinforcement agents as to whether a wireless device should roam from a first wireless access point to a second wireless access point. Generally, the reinforcement learning agents herein receive feedback in the form of a reward or credit assignment every time they make a determination (e.g. every time they instigate an action). A reward is allocated depending on the goal of the system" teaches that the wireless device (apparatus) includes a reinforcement learning (machine learning) agent that receives an observation of the environment in state S (receives an input), the input being sensed from an environment the wireless device (apparatus) is located. Fig. 1; [0045]: "Turning back to the first wireless device 100, the first wireless device may comprise a device capable, configured, arranged and/or operable to communicate wirelessly with network nodes and/or other wireless devices … Particular examples of such machines or devices are sensors" teaches that the wireless device (apparatus) can comprise sensors).

Claims 9-11 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Kvernvik et al. (US 2020/0344682 A1) in view of Li et al. (US 10,789,810 B1).
Regarding Claim 9,
	Kvernvik et al. teaches a machine learning apparatus, the machine learning apparatus comprising one or more processors configured to: form an output value function for achieving a predetermined objective (Fig. 1; [0035]: "FIG. 1 shows a first wireless device 100 according to some embodiments herein. The first wireless device 100 is connected to a first wireless access point in a first wireless communications network. The first wireless communications network is operated by a first network operator. The first wireless device 100 comprises a processor 102 and a memory 104. The memory 104 contains instructions executable by the processor 102. The first wireless device 100 may be operative to perform the methods described herein" teaches a wireless device (apparatus) comprising a processor 102. Fig. 1; [0038]-[0041]: "the first wireless device 100 is operative to (e.g. adapted to) acquire a determination from a first reinforcement learning agent 106 of whether to roam … As noted above, the use of reinforcement learning agents to determine whether to roam may allow for improved decision making in a moving device … Use of a reinforcement learning agent allows decisions to be updated (e.g. through learning and updating a model associated with the first reinforcement learning agent) dynamically as the environment changes, based on previous decisions (or actions) performed by the first reinforcement learning agent … reinforcement learning is a type of machine learning process whereby a reinforcement learning agent (e.g. algorithm) is used to take decisions (e.g. perform actions) on a system to adjust the system according to an objective (which may, for example, comprise moving the system towards an optimal or preferred state of the system) … a reinforcement learning agent receives an observation from the environment in state S and selects an action to maximize the expected future reward r. Based on the expected future rewards, a value function V for each state can be calculated and an optimal policy rr that maximizes the long term value function can be derived" teaches that the wireless device (apparatus) includes a reinforcement learning (machine learning) agent to calculate (form) a value function V (output value function) to achieve an objective (predetermined objective)) by 
iteratively learning successive candidates for the output value function ([0039]-[0041]: "As noted above, the use of reinforcement learning agents to determine whether to roam may allow for improved decision making in a moving device … Use of a reinforcement learning agent allows decisions to be updated (e.g. through learning and updating a model associated with the first reinforcement learning agent) dynamically as the environment changes, based on previous decisions (or actions) performed by the first reinforcement learning agent … reinforcement learning is a type of machine learning process whereby a reinforcement learning agent (e.g. algorithm) is used to take decisions (e.g. perform actions) on a system to adjust the system according to an objective (which may, for example, comprise moving the system towards an optimal or preferred state of the system) … a reinforcement learning agent receives an observation from the environment in state S and selects an action to maximize the expected future reward r. Based on the expected future rewards, a value function V for each state can be calculated and an optimal policy rr that maximizes the long term value function can be derived" teaches that the reinforcement learning agents dynamically update (e.g. iteratively) the action decisions as the environment changes based on previous decisions through learning and updating a model associated with the reinforcement learning agents, the reinforcement learning agents calculating a value function V (output value function) for each environmental state) in dependence on: 
(i) in each iteration, a first reward dependent on an environmental state determined by a current state of the output value function ([0054]-[0055]: "the determination may be based on (e.g. the input parameters or the current state input to the first reinforcement learning agent may comprise) an indication of a location of the first wireless device … As described above, when a reinforcement learning agent makes a determination (e.g. action), the first reinforcement learning agent receives a reward, based on the change that that action had on the system. In some embodiments, after an action (such as roaming) has been performed, the method may further comprise sending one or more observations (e.g. such as the connectivity or one or more quality metrics associated with the wireless access point that is serving the first wireless device) to the first reinforcement learning agent so that a reward may be determined from the observations. In some embodiments, the first reinforcement learning agent receives a reward by means of a parameter (e.g. a numerical value) indicative of the reward" teaches the first reinforcement learning agent taking a current state input (current environmental state) and determining an action to be performed, then the first reinforcement learning agent receives observations of the environment after performing the action (subsequent environmental state) and a reward based on the change the performed action had on the system (first reward). [0039]-[0041]: "As noted above, the use of reinforcement learning agents to determine whether to roam may allow for improved decision making in a moving device … Use of a reinforcement learning agent allows decisions to be updated (e.g. through learning and updating a model associated with the first reinforcement learning agent) dynamically as the environment changes, based on previous decisions (or actions) performed by the first reinforcement learning agent … reinforcement learning is a type of machine learning process whereby a reinforcement learning agent (e.g. algorithm) is used to take decisions (e.g. perform actions) on a system to adjust the system according to an objective (which may, for example, comprise moving the system towards an optimal or preferred state of the system) … a reinforcement learning agent receives an observation from the environment in state S and selects an action to maximize the expected future reward r. Based on the expected future rewards, a value function V for each state can be calculated and an optimal policy rr that maximizes the long term value function can be derived" teaches that the reinforcement learning agents dynamically update (e.g. iteratively) the action decisions as the environment changes based on previous decisions through learning and updating a model associated with the reinforcement learning agents, the reinforcement learning agents calculating a value function V (output value function) for the current environmental state).
Kvernvik et al. does not appear to explicitly teach (ii) in at least some iterations, a second reward formed by a second value function; and learn the second value function over successive iterations.
	However, Li et al. teaches (ii) in at least some iterations, a second reward formed by a second value function (Fig. 5; Col. 18, line 30 - Col. 19, line 4: "The apparatus 500 can correspond to the embodiments described above, and the apparatus 500 includes the following: a first obtaining module 501 for obtaining, in a current iteration of a plurality of iterations, an action selection policy of a current state in the current iteration, wherein the action selection policy specifies a respective probability of selecting an action among a plurality of possible actions in the current state, wherein the current state results from a previous action taken by the execution device in a previous state, and each action of the plurality of possible actions leads to a respective next state if performed by the execution device when the execution device is in the current state … a fourth computing module 506 for computing a second reward for the current state based on the respective first rewards for the actions and the action selection policy of the current state in the next iteration; a determining module 508 for determining an action selection policy of the previous state in the next iteration based on the second reward for the current state" teaches a second reward for the current state being computed (formed) based on an action selection policy (second value function)); and 
learn the second value function over successive iterations (Fig. 5; Col. 18, line 30 - Col. 19, line 4: "The apparatus 500 can correspond to the embodiments described above, and the apparatus 500 includes the following: a first obtaining module 501 for obtaining, in a current iteration of a plurality of iterations, an action selection policy of a current state in the current iteration, wherein the action selection policy specifies a respective probability of selecting an action among a plurality of possible actions in the current state, wherein the current state results from a previous action taken by the execution device in a previous state, and each action of the plurality of possible actions leads to a respective next state if performed by the execution device when the execution device is in the current state … a fourth computing module 506 for computing a second reward for the current state based on the respective first rewards for the actions and the action selection policy of the current state in the next iteration; a determining module 508 for determining an action selection policy of the previous state in the next iteration based on the second reward for the current state" teaches the action selection policy (second value function) being updated for a next iteration (e.g. learned over successive iterations) based on the second reward).
Kvernvik et al. is analogous to the claimed invention because it is directed to the implementation of reinforcement learning for value functions.
Li et al. is analogous to the claimed invention because it is directed to determining action policies for an execution device in an environment.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate (ii) in at least some iterations, a second reward formed by a second value function; and learn the second value function over successive iterations as taught by Li et al. to the disclosed invention of Kvernvik et al.
One of ordinary skill in the art would have been motivated to make this modification to "improve computational efficiency and reduce the computational load of the CFR algorithm in finding the best strategies of the real-world scenarios modeled by the IIG" (Li et al. Col. 20, lines 53-55).
Regarding Claim 10,
Kvernvik et al. in view of Li et al. teaches the machine learning apparatus of claim 9.
In addition, Kvernvik et al. further teaches wherein the subsequent environmental state is formed by a single iteration of the first agent function taking the current environmental state as input ([0054]-[0055]: "the determination may be based on (e.g. the input parameters or the current state input to the first reinforcement learning agent may comprise) an indication of a location of the first wireless device … As described above, when a reinforcement learning agent makes a determination (e.g. action), the first reinforcement learning agent receives a reward, based on the change that that action had on the system. In some embodiments, after an action (such as roaming) has been performed, the method may further comprise sending one or more observations (e.g. such as the connectivity or one or more quality metrics associated with the wireless access point that is serving the first wireless device) to the first reinforcement learning agent so that a reward may be determined from the observations. In some embodiments, the first reinforcement learning agent receives a reward by means of a parameter (e.g. a numerical value) indicative of the reward" teaches the first reinforcement learning agent (e.g. including the first agent function) taking a current state input (current environmental state) and determining an action to be performed, then the first reinforcement learning agent receives observations of the environment after performing the action (subsequent environmental state) (e.g. subsequent environmental state is formed based on current state being input to first agent)).
Regarding Claim 11,
Kvernvik et al. in view of Li et al. teaches the machine learning apparatus of claim 9.
In addition, Kvernvik et al. further teaches wherein the performance of the first agent function in meeting the predetermined objective is formed in dependence on the subsequent environmental state and/or the current environmental state ([0054]-[0055]: "the determination may be based on (e.g. the input parameters or the current state input to the first reinforcement learning agent may comprise) an indication of a location of the first wireless device … As described above, when a reinforcement learning agent makes a determination (e.g. action), the first reinforcement learning agent receives a reward, based on the change that that action had on the system. In some embodiments, after an action (such as roaming) has been performed, the method may further comprise sending one or more observations (e.g. such as the connectivity or one or more quality metrics associated with the wireless access point that is serving the first wireless device) to the first reinforcement learning agent so that a reward may be determined from the observations. In some embodiments, the first reinforcement learning agent receives a reward by means of a parameter (e.g. a numerical value) indicative of the reward" teaches the first reinforcement learning agent (e.g. including the first agent function) taking a current state input (current environmental state) and determining an action to be performed, then the first reinforcement learning agent receives observations of the environment after performing the action (subsequent environmental state) and a reward based on the change the performed action had on the system (first reward). [0040]: "The reinforcement learning agent receives a reward based on whether the action changes the system in compliance with the objective (e.g. towards the preferred state), or against the objective (e.g. further away from the preferred state). The reinforcement learning agent therefore adjusts parameters in the system with the goal of maximizing the rewards received" teaches that the reward is based on whether the action changes the system in compliance with an objective (e.g. the reward for the agent is dependent on if the action of moving from the current state to the subsequent state moves towards a preferred state (predetermined objective))).
Regarding Claim 13,
	Kvernvik et al. teaches a computer implemented machine learning method for forming an output value function for achieving a predetermined objective (Fig. 1; Fig. 2; [0050]: "FIG. 2 illustrates a method 200 that may be performed by a first wireless device, such as the first wireless device 100 described with respect to FIG. 1. In a first step 202, the method comprises acquiring a determination from a first reinforcement learning agent of whether to roam" teaches a method performed by a wireless device (computer) using a reinforcement learning (machine learning) agent. Fig. 1; [0038]-[0041]: "the first wireless device 100 is operative to (e.g. adapted to) acquire a determination from a first reinforcement learning agent 106 of whether to roam … As noted above, the use of reinforcement learning agents to determine whether to roam may allow for improved decision making in a moving device … Use of a reinforcement learning agent allows decisions to be updated (e.g. through learning and updating a model associated with the first reinforcement learning agent) dynamically as the environment changes, based on previous decisions (or actions) performed by the first reinforcement learning agent … reinforcement learning is a type of machine learning process whereby a reinforcement learning agent (e.g. algorithm) is used to take decisions (e.g. perform actions) on a system to adjust the system according to an objective (which may, for example, comprise moving the system towards an optimal or preferred state of the system) … a reinforcement learning agent receives an observation from the environment in state S and selects an action to maximize the expected future reward r. Based on the expected future rewards, a value function V for each state can be calculated and an optimal policy rr that maximizes the long term value function can be derived" teaches that the wireless device (apparatus) includes a reinforcement learning (machine learning) agent to calculate (form) a value function V (output value function) to achieve an objective (predetermined objective)), the method comprising: 
iteratively learning successive candidates for the output value function ([0039]-[0041]: "As noted above, the use of reinforcement learning agents to determine whether to roam may allow for improved decision making in a moving device … Use of a reinforcement learning agent allows decisions to be updated (e.g. through learning and updating a model associated with the first reinforcement learning agent) dynamically as the environment changes, based on previous decisions (or actions) performed by the first reinforcement learning agent … reinforcement learning is a type of machine learning process whereby a reinforcement learning agent (e.g. algorithm) is used to take decisions (e.g. perform actions) on a system to adjust the system according to an objective (which may, for example, comprise moving the system towards an optimal or preferred state of the system) … a reinforcement learning agent receives an observation from the environment in state S and selects an action to maximize the expected future reward r. Based on the expected future rewards, a value function V for each state can be calculated and an optimal policy rr that maximizes the long term value function can be derived" teaches that the reinforcement learning agents dynamically update (e.g. iteratively) the action decisions as the environment changes based on previous decisions through learning and updating a model associated with the reinforcement learning agents, the reinforcement learning agents calculating a value function V (output value function) for each environmental state) in dependence on: 
(i) in each iteration, a first reward dependent on an environmental state determined by a current state of the output value function ([0054]-[0055]: "the determination may be based on (e.g. the input parameters or the current state input to the first reinforcement learning agent may comprise) an indication of a location of the first wireless device … As described above, when a reinforcement learning agent makes a determination (e.g. action), the first reinforcement learning agent receives a reward, based on the change that that action had on the system. In some embodiments, after an action (such as roaming) has been performed, the method may further comprise sending one or more observations (e.g. such as the connectivity or one or more quality metrics associated with the wireless access point that is serving the first wireless device) to the first reinforcement learning agent so that a reward may be determined from the observations. In some embodiments, the first reinforcement learning agent receives a reward by means of a parameter (e.g. a numerical value) indicative of the reward" teaches the first reinforcement learning agent taking a current state input (current environmental state) and determining an action to be performed, then the first reinforcement learning agent receives observations of the environment after performing the action (subsequent environmental state) and a reward based on the change the performed action had on the system (first reward). [0039]-[0041]: "As noted above, the use of reinforcement learning agents to determine whether to roam may allow for improved decision making in a moving device … Use of a reinforcement learning agent allows decisions to be updated (e.g. through learning and updating a model associated with the first reinforcement learning agent) dynamically as the environment changes, based on previous decisions (or actions) performed by the first reinforcement learning agent … reinforcement learning is a type of machine learning process whereby a reinforcement learning agent (e.g. algorithm) is used to take decisions (e.g. perform actions) on a system to adjust the system according to an objective (which may, for example, comprise moving the system towards an optimal or preferred state of the system) … a reinforcement learning agent receives an observation from the environment in state S and selects an action to maximize the expected future reward r. Based on the expected future rewards, a value function V for each state can be calculated and an optimal policy rr that maximizes the long term value function can be derived" teaches that the reinforcement learning agents dynamically update (e.g. iteratively) the action decisions as the environment changes based on previous decisions through learning and updating a model associated with the reinforcement learning agents, the reinforcement learning agents calculating a value function V (output value function) for the current environmental state).
Kvernvik et al. does not appear to explicitly teach (ii) in at least some iterations, a second reward formed by a second value function; and learning the second value function over successive iterations.
	However, Li et al. teaches (ii) in at least some iterations, a second reward formed by a second value function (Fig. 5; Col. 18, line 30 - Col. 19, line 4: "The apparatus 500 can correspond to the embodiments described above, and the apparatus 500 includes the following: a first obtaining module 501 for obtaining, in a current iteration of a plurality of iterations, an action selection policy of a current state in the current iteration, wherein the action selection policy specifies a respective probability of selecting an action among a plurality of possible actions in the current state, wherein the current state results from a previous action taken by the execution device in a previous state, and each action of the plurality of possible actions leads to a respective next state if performed by the execution device when the execution device is in the current state … a fourth computing module 506 for computing a second reward for the current state based on the respective first rewards for the actions and the action selection policy of the current state in the next iteration; a determining module 508 for determining an action selection policy of the previous state in the next iteration based on the second reward for the current state" teaches a second reward for the current state being computed (formed) based on an action selection policy (second value function)); and 
learning the second value function over successive iterations (Fig. 5; Col. 18, line 30 - Col. 19, line 4: "The apparatus 500 can correspond to the embodiments described above, and the apparatus 500 includes the following: a first obtaining module 501 for obtaining, in a current iteration of a plurality of iterations, an action selection policy of a current state in the current iteration, wherein the action selection policy specifies a respective probability of selecting an action among a plurality of possible actions in the current state, wherein the current state results from a previous action taken by the execution device in a previous state, and each action of the plurality of possible actions leads to a respective next state if performed by the execution device when the execution device is in the current state … a fourth computing module 506 for computing a second reward for the current state based on the respective first rewards for the actions and the action selection policy of the current state in the next iteration; a determining module 508 for determining an action selection policy of the previous state in the next iteration based on the second reward for the current state" teaches the action selection policy (second value function) being updated for a next iteration (e.g. learned over successive iterations) based on the second reward).
Kvernvik et al. is analogous to the claimed invention because it is directed to the implementation of reinforcement learning for value functions.
Li et al. is analogous to the claimed invention because it is directed to determining action policies for an execution device in an environment.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate (ii) in at least some iterations, a second reward formed by a second value function; and learning the second value function over successive iterations as taught by Li et al. to the disclosed invention of Kvernvik et al.
One of ordinary skill in the art would have been motivated to make this modification to "improve computational efficiency and reduce the computational load of the CFR algorithm in finding the best strategies of the real-world scenarios modeled by the IIG" (Li et al. Col. 20, lines 53-55).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BRIAN J HALES whose telephone number is (571)272-0878. The examiner can normally be reached M-F 9:00am - 5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached at (571) 272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/BRIAN J HALES/Examiner, Art Unit 2125                                                                                                                                                                                                        
/KAMRAN AFSHAR/Supervisory Patent Examiner, Art Unit 2125
Read full office action
Prosecution Timeline

Aug 04, 2023
Application Filed
Apr 02, 2026
Non-Final Rejection — §101, §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/718,612
Patent 12572788
WEIGHT CONFIRMATION METHOD FOR AN ANALOG SYNAPTIC DEVICE OF AN ARTIFICIAL NEURAL NETWORK
2y 5m to grant Granted Mar 10, 2026
17/304,365
Patent 12547910
DISTRIBUTING STRUCTURE RISK ASSESSMENT USING INFORMATION DISTRIBUTION STATIONS
2y 5m to grant Granted Feb 10, 2026
17/124,018
Patent 12493796
USING GENERATIVE ADVERSARIAL NETWORKS TO CONSTRUCT REALISTIC COUNTERFACTUAL EXPLANATIONS FOR MACHINE LEARNING MODELS
2y 5m to grant Granted Dec 09, 2025
17/566,885
Patent 12475369
BUILDING AND EXECUTING DEEP LEARNING-BASED DATA PIPELINES
2y 5m to grant Granted Nov 18, 2025
17/492,172
Patent 12450468
PHYSICS AUGMENTED NEURAL NETWORKS CONFIGURED FOR OPERATING IN ENVIRONMENTS THAT MIX ORDER AND CHAOS
2y 5m to grant Granted Oct 21, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
77%
Grant Probability
99%
With Interview (+32.0%)
4y 0m
Median Time to Grant
Low
PTA Risk
Based on 84 resolved cases by this examiner. Grant probability derived from career allow rate.