Office Action Analysis: 17909835 — LEARNING DEVICE, LEARNING METHOD, AND RECORDING MEDIUM

Examiner Intelligence

SIPPEL, MOLLY CLARKE View full profile →
Grants 50% of resolved cases
Career Allow Rate
7 granted / 14 resolved
-5.0% vs TC avg
Strong +58% interview lift
Without
With
+58.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 7m
Avg Prosecution
25 currently pending
Career history
39
Total Applications
across all art units
Statute-Specific Performance

§101
33.8%
-6.2% vs TC avg
§103
32.0%
-8.0% vs TC avg
§102
9.8%
-30.2% vs TC avg
§112
23.6%
-16.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 14 resolved cases
Office Action

§101 §103 §112
DETAILED ACTION
	This action is responsive to the amendment filed on 10/28/2025. Claims 1-19 are pending in the case. Claims 1-9 are currently amended. Claims 1, 5, and 6 are independent claims. 

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Acknowledgment is made of applicant’s claim for foreign priority based on international patent application no. PCT/JP2020/011465 filed on 03/16/2020. 

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 1-19 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Regarding claim 1, the claim recites “the determined control” on line 13. The claim also recites “the control”. It is unclear if applicant is attempting to recite a new claim element or attempting to refer to a previously recited claim element. Thus, the claim is indefinite for failing to particularly point out and distinctly claim the subject matter. For examination purposes this claim element has been interpreted to mean “the control”, referring to a previously recited claim element. 
Further, the claim recites “the determined difficulty” on lines 17-18. The claim also recites “difficulty to be set to the target system” and “difficulty of the control”. It is unclear if applicant is attempting to recite a new claim element or attempting to refer to a previously recited claim element. Thus, the claim is indefinite for failing to particularly point out and distinctly claim the subject matter. For examination purposes this claim element has been interpreted to mean “the difficulty to be set to the target system”, referring to a previously recited claim element. 
Further, claim 1 recites: “an original evaluation” on line 17. The claim also recites: “a plurality of original evaluations of states” on lines 12-13. It is unclear if applicant is attempting to recite a new claim element or attempting to refer to a previously recited claim element. Thus, the claim is indefinite for failing to particularly point out and distinctly claim the subject matter. For examination purposed, the claim element has been interpreted to mean “an original evaluation of the plurality of original evaluations of states”. 

	Claims 2-3 are rejected as being dependent on a rejected base claim without curing any of the deficiencies.

	Regarding claim 4, the claim recites: “the number of processing times” on lines 3-4. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, this limitation has been interpreted to mean “a number of processing times”. 

Regarding claim 5, the claim recites “the determined difficulty” on lines 11, 14, and 17. The claim also recites “difficulty to be set to the target system” and “difficulty of the control”. It is unclear if applicant is attempting to recite a new claim element or attempting to refer to a previously recited claim element. Thus, the claim is indefinite for failing to particularly point out and distinctly claim the subject matter. For examination purposes this claim element has been interpreted to mean “the difficulty to be set to the target system”, referring to a previously recited claim element.
Further, claim 5 recites “an original evaluation” on line 14. The claim also recites: “a plurality of original evaluations of states” on lines 9-10. It is unclear if applicant is attempting to recite a new claim element or attempting to refer to a previously recited claim element. Thus, the claim is indefinite for failing to particularly point out and distinctly claim the subject matter. For examination purposed, the claim element has been interpreted to mean “an original evaluation of the plurality of original evaluations of states”.
	
Regarding claim 6, the claim recites “the determined difficulty” on lines 14, 18, and 20. The claim also recites “difficulty to be set to the target system” and “difficulty of the control”. It is unclear if applicant is attempting to recite a new claim element or attempting to refer to a previously recited claim element. For examination purposes this claim element has been interpreted to mean “the difficulty to be set to the target system”, referring to a previously recited claim element.
Further, claim 6 recites “an original evaluation” on line 17. The claim also recites: “a plurality of original evaluations of states” on lines 12-13. It is unclear if applicant is attempting to recite a new claim element or attempting to refer to a previously recited claim element. Thus, the claim is indefinite for failing to particularly point out and distinctly claim the subject matter. For examination purposed, the claim element has been interpreted to mean “an original evaluation of the plurality of original evaluations of states”.

	Claim 7 is rejected as being dependent upon a rejected base claim without curing any of the deficiencies. 

	Regarding claim 8, the claim recites “the number of processing times” on lines 3-4. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, this limitation has been interpreted to mean “a number of processing times”.

	Regarding claim 9, the claim recites “the number of processing times” on lines 3-4. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, this limitation has been interpreted to mean “a number of processing times”.

	Claim 10 is rejected as being dependent upon a rejected base claim without curing any of the deficiencies. 

Regarding claim 11, the claim recites: “the determined difficulty” on line 3. The parent claim recites “difficulty to be set to the target system” and “difficulty of the control”. It is unclear if applicant is attempting to recite a new claim element or attempting to refer to a previously recited claim element. For examination purposes this claim element has been interpreted to mean “the difficulty to be set to the target system”, referring to a previously recited claim element.

	Regarding claim 12, the claim recites: “the determined difficulty” on line 4. The parent claim recites “difficulty to be set to the target system” and “difficulty of the control”. It is unclear if applicant is attempting to recite a new claim element or attempting to refer to a previously recited claim element. For examination purposes this claim element has been interpreted to mean “the difficulty to be set to the target system”, referring to a previously recited claim element.

	Claim 13 is rejected as being dependent upon a rejected base claim without curing any of the deficiencies. 

	Regarding claim 14, the claim recites: “the determined difficulty” on lines 3-4. The parent claim recites “difficulty to be set to the target system” and “difficulty of the control”. It is unclear if applicant is attempting to recite a new claim element or attempting to refer to a previously recited claim element. For examination purposes this claim element has been interpreted to mean “the difficulty to be set to the target system”, referring to a previously recited claim element.

	Regarding claim 15, the claim recites: “the determined difficulty” on line 4. The parent claim recites “difficulty to be set to the target system” and “difficulty of the control”. It is unclear if applicant is attempting to recite a new claim element or attempting to refer to a previously recited claim element. For examination purposes this claim element has been interpreted to mean “the difficulty to be set to the target system”, referring to a previously recited claim element.

Regarding claim 16, The term “small” is a relative term which renders the claim indefinite. The term “small” is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. The "decrease rate" has been rendered indefinite by the use of "larger".
Further, the term “large” in claim 16 is a relative term which renders the claim indefinite. The term “large” is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. The "increase rate" is rendered indefinite by the use of "smaller".
Further, the term “low” is a relative term which renders the claim indefinite. The term “low” is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. The "learning progress" has been rendered indefinite by the use of the term "low".

Regarding claim 17, The term “larger” is a relative term which renders the claim indefinite. The term “larger” is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. The "decrease rate" has been rendered indefinite by the use of "larger".
Further, the term “smaller” in claim 17 is a relative term which renders the claim indefinite. The term “smaller” is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. The "increase rate" is rendered indefinite by the use of "smaller".
Further, the term “high” in claim 17 is a relative term which renders the claim indefinite. The term “high” is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. The "learning progress" is rendered indefinite by the user of "high".

Claims 18-19 are rejected as being dependent upon a rejected base claim without curing any of the deficiencies. 

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-19 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 

	Regarding claim 1: 
	Step 1 Statutory Category: Claim 1 is directed to a device, which falls under one of the four statutory categories. 
	Step 2A Prong 1 Judicial Exception: Claim 1 recites, in part, “determine control to be applied to the target system and difficulty to be set to the target system using observation information which is obtainable from the individual devices of the robot and numerical values of elements for determining difficulty of the control, according to the policy”. This limitation, under the broadest reasonable interpretation, covers the recitation of a mental process that can practically be performed in the human mind, with or without the use of a physical aid such as pen and paper (including an observation, evaluation, judgment, opinion), in this case a judgment, see MPEP §2106.04(a)(2)(III). Further, the claim recites “calculate learning progress of the policy using a plurality of original evaluations of states before and after transition of the target system and the determined control, according to the control and the difficulty to be set to the target system”. This limitation, under the broadest reasonable interpretation, covers the recitation of a mathematical calculation, as directed to “a claim that recites a mathematical calculation, when the claim is given its broadest reasonable interpretation in light of the specification, will be considered as falling within the "mathematical concepts" grouping. A mathematical calculation is a mathematical operation (such as multiplication) or an act of calculating using mathematical methods to determine a variable or number”.  See MPEP § 2106.04(a)(2)(I)(C). Further, the claim recites: “calculate a moving average or a moving standard deviation of a plurality of accumulated non-adjusted rewards as the learning process”. This limitation, under the broadest reasonable interpretation, covers the recitation of a mathematical calculation, as directed to “a claim that recites a mathematical calculation, when the claim is given its broadest reasonable interpretation in light of the specification, will be considered as falling within the "mathematical concepts" grouping. A mathematical calculation is a mathematical operation (such as multiplication) or an act of calculating using mathematical methods to determine a variable or number”.  See MPEP § 2106.04(a)(2)(I)(C). Further, the claim recites “calculate revised evaluation using an original evaluation, the determined difficulty, and the learning progress”. This limitation, under the broadest reasonable interpretation, covers the recitation of a mathematical calculation, as directed to “a claim that recites a mathematical calculation, when the claim is given its broadest reasonable interpretation in light of the specification, will be considered as falling within the "mathematical concepts" grouping. A mathematical calculation is a mathematical operation (such as multiplication) or an act of calculating using mathematical methods to determine a variable or number”.  See MPEP § 2106.04(a)(2)(I)(C). Further, the claim recites “update the policy using the observation information, the control, the difficulty to be set to the target system, and the revised evaluation”. This limitation, under the broadest reasonable interpretation, covers the recitation of mathematical concepts, see MPEP § 2106.04(a)(2)(I). 
	Step 2A Prong 2 Integration into a Practical Application: This judicial exception is not integrated into a practical application. In particular the claim recites: “a learning device”, “a memory storing software instructions”, and “one or more processors configured to execute the software instructions”. These are additional elements that amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process.  See MPEP §2106.05(f). Further, the claim recites: “a robot system controlling individual devices of a robot”. This limitation is an additional element that generally links the use of the judicial exception to a particular technological environment or field of use.  See MPEP §2106.05(h).
	Step 2B Significantly More: The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements: “a learning device”, “a memory storing software instructions”, and “one or more processors configured to execute the software instructions” are additional elements that amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. Further, the additional element “a robot system controlling individual devices of a robot” generally links the use of the judicial exception to a particular technological environment or field of use. Elements that merely generally link the use of the judicial exception to a particular technological environment or field of use cannot provide an inventive concept. The claim is not patent eligible. 

	Regarding claim 2, the rejection of claim 1 is incorporated, and further, the claim recites, “determine the control and the difficulty to be set to the target system further using the learning progress”. This limitation recites mental processes in addition to those identified in the rejection of the parent claim, thus the claim recites a judicial exception. 
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.

	Regarding claim 3, the rejection of claim 1 is incorporated, and further, the claim recites: “determine the difficulty to be set to the target system, using the learning progress and the difficulty to be set to the target system”. This limitation is a continuation of the “determine control to be applied to the target system and difficulty to be set to the target system using observation information which is obtainable from the individual devices of the robot and numerical values of elements for determining difficulty of the control, according to the policy” limitation identified as an abstract idea in the rejection of the parent claim. Thus, the claim recites a judicial exception.  
The claim also recites, “the one or more processors are further configured to execute the software instructions to”. This limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f). Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. The claim is not patent eligible.

	Regarding claim 4, the rejection of claim 1 is incorporated, and further, the claim recites: “repeatedly execute a calculation process of the difficulty to be set to the target system until the number of processing times exceeds a threshold value”. This limitation recites mathematical concepts in addition to those identified in the rejection of claim 1, thus recites a judicial exception. 
The claim also recites, “the one or more processors are further configured to execute the software instructions to”. This limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f). Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. The claim is not patent eligible.

	Regarding claim 5: 
	Step 1 Statutory Category: Claim 5 is directed to a method, which falls under one of the four statutory categories. 
	Step 2A Prong 1 Judicial Exception: Claim 5 recites, in part, “determining control to be applied to the target system and difficulty to be set to the target system using observation information which is obtainable from the individual devices of the robot and numerical values of elements for determining difficulty of the control, according to the policy”. This limitation, under the broadest reasonable interpretation, covers the recitation of a mental process that can practically be performed in the human mind, with or without the use of a physical aid such as pen and paper (including an observation, evaluation, judgment, opinion), in this case a judgment, see MPEP §2106.04(a)(2)(III). Further, the claim recites “calculating learning progress of the policy using a plurality of original evaluations of states before and after transition of the target system and the control, according to the control and the determined difficulty”. This limitation, under the broadest reasonable interpretation, covers the recitation of a mathematical calculation, as directed to “a claim that recites a mathematical calculation, when the claim is given its broadest reasonable interpretation in light of the specification, will be considered as falling within the "mathematical concepts" grouping. A mathematical calculation is a mathematical operation (such as multiplication) or an act of calculating using mathematical methods to determine a variable or number”.  See MPEP § 2106.04(a)(2)(I)(C). Further, the claim recites: “calculating a moving average or a moving standard deviation of a plurality of accumulated non-adjusted rewards as the learning progress”. This limitation, under the broadest reasonable interpretation, covers the recitation of a mathematical calculation, as directed to “a claim that recites a mathematical calculation, when the claim is given its broadest reasonable interpretation in light of the specification, will be considered as falling within the "mathematical concepts" grouping. A mathematical calculation is a mathematical operation (such as multiplication) or an act of calculating using mathematical methods to determine a variable or number”.  See MPEP § 2106.04(a)(2)(I)(C). Further, the claim recites “calculating revised evaluation using an original evaluation, the determined difficulty, and the learning progress”. This limitation, under the broadest reasonable interpretation, covers the recitation of a mathematical calculation, as directed to “a claim that recites a mathematical calculation, when the claim is given its broadest reasonable interpretation in light of the specification, will be considered as falling within the "mathematical concepts" grouping. A mathematical calculation is a mathematical operation (such as multiplication) or an act of calculating using mathematical methods to determine a variable or number”.  See MPEP § 2106.04(a)(2)(I)(C). Further, the claim recites “updating the policy using the observation information, the control, the determined difficulty, and the revised evaluation”. This limitation, under the broadest reasonable interpretation, covers the recitation of mathematical concepts, see MPEP § 2106.04(a)(2)(I).
	Step 2A Prong 2 Integration into a Practical Application: This judicial exception is not integrated into a practical application. In particular the claim recites: “a processor”. This is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process.  See MPEP §2106.05(f). Further, the claim recites: “a robot system controlling individual devices that make up a robot”. This limitation is an additional element that generally links the use of the judicial exception to a particular technological environment or field of use.  See MPEP §2106.05(h).
	Step 2B Significantly More: The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element: “a processor” is an additional element that amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. Further, the additional element “a robot system controlling individual devices that make up a robot” generally links the use of the judicial exception to a particular technological environment or field of use. Elements that merely generally link the use of the judicial exception to a particular technological environment or field of use cannot provide an inventive concept. The claim is not patent eligible. 

	Regarding claim 6: 
Step 1 Statutory Category: Claim 6 is directed to a machine, which falls under one of the four statutory categories. 
	Step 2A Prong 1 Judicial Exception: Claim 6 recites, in part, “a process of determining control to be applied to the target system and difficulty to be set to the target system using observation information which is obtainable from the individual devices of the robot and numerical values of elements for determining difficulty of the control, according to the policy”. This limitation, under the broadest reasonable interpretation, covers the recitation of a mental process that can practically be performed in the human mind, with or without the use of a physical aid such as pen and paper (including an observation, evaluation, judgment, opinion), in this case a judgment, see MPEP §2106.04(a)(2)(III). Further, the claim recites “a process of calculating learning progress of the policy using a plurality of original evaluations of states before and after transition of the target system and the control, according to the control and the determined difficulty”. This limitation, under the broadest reasonable interpretation, covers the recitation of a mathematical calculation, as directed to “a claim that recites a mathematical calculation, when the claim is given its broadest reasonable interpretation in light of the specification, will be considered as falling within the "mathematical concepts" grouping. A mathematical calculation is a mathematical operation (such as multiplication) or an act of calculating using mathematical methods to determine a variable or number”.  See MPEP § 2106.04(a)(2)(I)(C). Further, the claim recites: “a process of calculating a moving average or a moving standard deviation of a plurality of accumulated non-adjusted rewards as the learning progress”. This limitation, under the broadest reasonable interpretation, covers the recitation of a mathematical calculation, as directed to “a claim that recites a mathematical calculation, when the claim is given its broadest reasonable interpretation in light of the specification, will be considered as falling within the "mathematical concepts" grouping. A mathematical calculation is a mathematical operation (such as multiplication) or an act of calculating using mathematical methods to determine a variable or number”.  See MPEP § 2106.04(a)(2)(I)(C). Further, the claim recites “a process of calculating revised evaluation using an original evaluation, the determined difficulty, and the learning progress”. This limitation, under the broadest reasonable interpretation, covers the recitation of a mathematical calculation, as directed to “a claim that recites a mathematical calculation, when the claim is given its broadest reasonable interpretation in light of the specification, will be considered as falling within the "mathematical concepts" grouping. A mathematical calculation is a mathematical operation (such as multiplication) or an act of calculating using mathematical methods to determine a variable or number”.  See MPEP § 2106.04(a)(2)(I)(C). Further, the claim recites “a process of updating the policy using the observation information, the control, the determined difficulty, and the revised evaluation”. This limitation, under the broadest reasonable interpretation, covers the recitation of mathematical concepts, see MPEP § 2106.04(a)(2)(I).
	Step 2A Prong 2 Integration into a Practical Application: This judicial exception is not integrated into a practical application. In particular the claim recites: “a non-transitory computer readable recording medium storing a learning program” and “a computer”. These are additional elements that amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process.  See MPEP §2106.05(f). Further, the claim recites: “a robot system controlling individual devices that make up a robot”. This limitation is an additional element that generally links the use of the judicial exception to a particular technological environment or field of use.  See MPEP §2106.05(h).
	Step 2B Significantly More: The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements: “a non-transitory computer readable recording medium storing a learning program” and “a computer” are additional elements that amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. Further, the additional element “a robot system controlling individual devices that make up a robot” generally links the use of the judicial exception to a particular technological environment or field of use. Elements that merely generally link the use of the judicial exception to a particular technological environment or field of use cannot provide an inventive concept. The claim is not patent eligible.

	Regarding claim 7, the rejection of claim 2 is incorporated, and further, the claim recites: “determine the difficulty to be set to the target system, using the learning progress and the difficulty to be set to the target system”. This limitation is a continuation of the “determine control to be applied to the target system and difficulty to be set to the target system using observation information which is obtainable from the individual devices of the robot and numerical values of elements for determining difficulty of the control, according to the policy” limitation identified as an abstract idea in the rejection of the parent claim. Thus, the claim recites a judicial exception.
The claim also recites, “the one or more processors are further configured to execute the software instructions to”. This limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f). Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. The claim is not patent eligible.

	Regarding claim 8, the rejection of claim 2 is incorporated, and further, the claim recites: “repeatedly execute a calculation process of the difficulty to be set to the target system until the number of processing times exceeds a threshold value”. This limitation recites mathematical concepts in addition to those identified in the rejection of claim 1, thus recites a judicial exception. 
The claim also recites, “the one or more processors are further configured to execute the software instructions to”. This limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f). Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. The claim is not patent eligible.

Regarding claim 9, the rejection of claim 3 is incorporated, and further, the claim recites: “repeatedly execute a calculation process of the difficulty to be set to the target system until the number of processing times exceeds a threshold value”. This limitation recites mathematical concepts in addition to those identified in the rejection of claim 1, thus recites a judicial exception. 
The claim also recites, “the one or more processors are further configured to execute the software instructions to”. This limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f). Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. The claim is not patent eligible.

	Regarding claim 10, the rejection of claim 1 is incorporated, and further, the claim recites: “simultaneously determine and output the control and the difficulty via the policy”. This limitation is a continuation of the “determine control to be applied to the target system and difficulty to be set to the target system using observation information which is obtainable from the individual devices of the robot and numerical values of elements for determining difficulty of the control, according to the policy” limitation identified as an abstract idea in the rejection of the parent claim. Thus, the claim recites a judicial exception.
The claim also recites, “the one or more processors are further configured to execute the software instructions to”. This limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f). Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. The claim is not patent eligible.

	Regarding claim 11, the rejection of claim 1 is incorporated, and further, the claim recites: “use an extended observation including the determined difficulty and the learning progress of the policy, in addition to the observation information, as an input to the policy”. This limitation is a continuation of the “determine control to be applied to the target system and difficulty to be set to the target system using observation information which is obtainable from the individual devices of the robot and numerical values of elements for determining difficulty of the control, according to the policy” limitation identified as an abstract idea in the rejection of the parent claim. Thus, the claim recites a judicial exception.
The claim also recites, “the one or more processors are further configured to execute the software instructions to”. This limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f). Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. The claim is not patent eligible.

Regarding claim 12, the rejection of claim 1 is incorporated, and further, the claim recites: “calculate the revised evaluation by increasing or decreasing the original evaluation according to the determined difficulty and the calculated learning progress”. This limitation is a continuation of the “calculate revised evaluation using an original evaluation, the determined difficulty, and the learning progress” limitation identified as an abstract idea in the rejection of the parent claim. Thus, the claim recites a judicial exception.
The claim also recites, “the one or more processors are further configured to execute the software instructions to”. This limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f). Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. The claim is not patent eligible.

	Regarding claim 13, the rejection of claim 11 is incorporated, and further, the claim recites: “generate an extended action including the control and the difficulty as elements”. This limitation recites mathematical concepts in addition to those identified in the rejection of the parent claim and thus recites a judicial exception. 
The claim also recites, “the one or more processors are further configured to execute the software instructions to”. This limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f). Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. The claim is not patent eligible.

	Regarding claim 14, the rejection of claim 11 is incorporated, and further, the claim recites: “use an extended observation obtained by concatenating the observation information, the determined difficulty, and the learning progress as an input to the policy”. This limitation is a continuation of the “determine control to be applied to the target system and difficulty to be set to the target system using observation information which is obtainable from the individual devices of the robot and numerical values of elements for determining difficulty of the control, according to the policy” limitation identified as an abstract idea in the rejection of the parent claim. Thus, the claim recites a judicial exception.
The claim also recites, “the one or more processors are further configured to execute the software instructions to”. This limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f). Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. The claim is not patent eligible.

Regarding claim 15, the rejection of claim 12 is incorporated, and further, the claim recites: “calculate the revised evaluation by multiplying the original evaluation by an adjustment coefficient corresponding to the determined difficulty and the calculated learning progress”. This limitation is a continuation of the “calculate revised evaluation using an original evaluation, the determined difficulty, and the learning progress” limitation identified as an abstract idea in the rejection of the parent claim. Thus, the claim recites a judicial exception.
The claim also recites, “the one or more processors are further configured to execute the software instructions to”. This limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f). Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. The claim is not patent eligible.

Regarding claim 16, the rejection of claim 15 is incorporated, and further, the claim recites: “the adjustment coefficient is set such that a small decrease rate or a large increase rate is applied regardless of the difficulty when the learning progress is low”. This limitation is a continuation of the “calculate the revised evaluation by multiplying the original evaluation by an adjustment coefficient corresponding to the determined difficulty and the calculated learning progress” limitation identified as an abstract idea in the rejection of the parent claim, thus the claim recites a judicial exception. 
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 17, the rejection of claim 15 is incorporated, and further, the claim recites: “the adjustment coefficient is set such that a larger decrease rate or a smaller increase rate is applied as the difficulty is lower when the learning progress is high”. This limitation is a continuation of the “calculate the revised evaluation by multiplying the original evaluation by an adjustment coefficient corresponding to the determined difficulty and the calculated learning progress” limitation identified as an abstract idea in the rejection of the parent claim, thus the claim recites a judicial exception. 
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 18, the rejection of claim 10 is incorporated, and further, the claim recites: “the difficulty is converted into an environmental parameter that controls state transition characteristics of the target system”. This limitation recites mathematical concepts in addition to those identified in the rejection of the parent claim, thus the claim recites a judicial exception. 
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 19, the rejection of claim 1 is incorporated, and further, the claim recites: “control the target system using the learned policy”. This limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f). The claim also recites, “the one or more processors are further configured to execute the software instructions to”. This limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f). Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. The claim is not patent eligible.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3, 5-7, 10-19 are rejected under 35 U.S.C. 103 as being unpatentable over Fabisch et al., Accounting for Task-Difficulty in Active Multi-Task Robot Control Learning, 05/01/2015, https://link.springer.com/article/10.1007/s13218-015-0363-2 hereinafter referred to as “Fabisch” in view of Bello et al., U.S. Patent Application Publication No. 20200057941, hereinafter referred to as “Bello”.

Regarding claim 1, Fabisch teaches A learning device learning a policy that determines how to control a target system that is a robot system controlling individual devices of a robot (Fabisch, Page 369, Abstract, Lines 1-7, “Contextual policy search is a reinforcement learning approach for multi-task learning in the context of robot control learning. It can be used to learn versatilely applicable skills that generalize over a range of tasks specified by a context vector. In this work, we combine contextual policy search with ideas from active learning for selecting the task in which the next trial will be performed”; Fabisch, Page 374, Section 4.3, Line 1, “We use the simulated arm of the robot Artemis”; The “arm of the robot Artemis” is considered to be the “robot system” and the joints are considered to be the “individual devices of a robot”), comprising: a memory storing software instructions, and one or more processors configured to execute the software instructions (Fabisch, Page 374, Paragraph 2, Lines 1-2, “We perform 50 runs with 10,000 episodes. The learning curves are displayed in Fig. 2b”; It would be obvious to a person of ordinary skill in the art that this must be performed on a computer, which provides evidence for a memory, software instructions, and one or more processors) to 
determine control to be applied to the target system and difficulty to be set to the target system using observation information which is obtainable from the individual devices of the robot and numerical values of elements for determining difficulty of the control, according to the policy (Fabisch, Page 371, Section 3.2, Line 1, “We are given a set of observations                         
                            D
                            =
                            
                                    {
                                    
                                                    s
                                                
                                                    i
                                                
                                            ,
                                             
                                                    r
                                                
                                                    i
                                                
                                    }
                                
                                    i
                                    =
                                    1
                                
                                    n
                                
                    ”; Fabisch, Page 371, Paragraph 2, Lines 1-4, “We are interested in inferring the function V from observations D, i.e., learn an estimate                         
                            
                                    V
                                
                                ^
                            
                     of V. One natural constraint on the estimate is that                         
                            
                                    V
                                
                                ^
                            
                            (
                            
                                    s
                                
                                    i
                                
                            )
                            ≥
                            
                                    r
                                
                                    i
                                
                    , i.e.,                         
                            
                                    V
                                
                                ^
                            
                     shall be an upper boundary on D”; The “upper boundary” is considered to be the “difficulty to be set to the target system”; Fabisch, Page 372, Paragraph 2, Lines 7-10, “We propose now a new method, the positive upper boundary support vector estimation (PUBSVE), for learning such a model of the upper boundary”; Fabisch, Page 374, Section 4.2, Lines 1-6, “The catapult benchmark problem has been introduced by da Silva et al. [15]. It is shown in Fig. 3a. The goal is to learn an upper-level policy that generates appropriate actions                         
                            
                                    a
                                
                                    i
                                
                    , consisting of the angle                         
                            
                                    θ
                                
                                    i
                                
                            ∈
                            [
                            0
                            ,
                            
                                    π
                                
                                    2
                                
                            ]
                        
                     and velocity                         
                            
                                    v
                                
                                    i
                                
                            ∈
                            [
                            5,10
                            ]
                        
                     of the catapult’s shot, such that a specific target position, the context                         
                            
                                    s
                                
                                    i
                                
                            ∈
                            [
                            2,10
                            ]
                        
                    , is hit”; The “actions” are considered to be the “control[s]”; Fabisch, Page 371, Section 3.1, Paragraph 2, Lines 1-2, “Samples (s, a, r(s, a)) are required for contextual policy search”);
calculate learning progress of the policy using a plurality of original evaluations of states before and after transition of the target system and the determined control, according to the control and the difficulty to be set to the target system (Fabisch, Page 375, Section 5, Paragraph 2, Lines 1-8, “Some of the evaluated intrinsic reward heuristics use a baseline to estimate the learning progress so that it is comparable between different contexts. This baseline can be an estimate of the upper boundary of the reward in the corresponding context; for instance, the monotonic progress heuristic computes the learning progress depending on the maximum of the previous rewards b as                         
                            m
                            a
                            x
                            ⁡
                            (
                            0
                            ,
                             
                            r
                            
                                    s
                                    ,
                                     
                                    a
                                
                            -
                            b
                            )
                        
                    . We can now replace b by                         
                            
                                    V
                                
                                ^
                            
                            (
                            s
                            )
                        
                    ”; Because the learning progress is calculated based on “previous rewards” and the rewards are considered to be the “original evaluations”, it is considered to be “before and after transition”);
…
calculate revised evaluation using an original evaluation, the determined difficulty, and the learning progress (Fabisch, Page 372, Section 3.3, Lines 1-5, “In this section, we describe how PUBSVE allows to normalize the obtained reward: for a given context s, we map the PUBSVE prediction                         
                            
                                    V
                                
                                ^
                            
                            (
                            s
                            )
                        
                     onto 1 and the context’s typical reward level                         
                            
                                    r
                                
                                ~
                            
                            (
                            s
                            )
                        
                     onto 0. This can be achieved via                         
                            
                                            r
                                        
                                            i
                                        
                                -
                            
                            =
                            
                                            r
                                        
                                            i
                                        
                                    -
                                    
                                            r
                                        
                                        ~
                                    
                                    (
                                    
                                            s
                                        
                                            i
                                        
                                    )
                                
                                            V
                                        
                                        ^
                                    
                                                    s
                                                
                                                    i
                                                
                                    -
                                    
                                            r
                                        
                                        ~
                                    
                                    (
                                    
                                            s
                                        
                                            i
                                        
                                    )
                                
                    ”;                         
                            
                                            r
                                        
                                            i
                                        
                                -
                            
                     is considered to be the “revised evaluation”; Fabisch, Page 371, Section 3.1, Paragraph 3, Lines 5-8, “We can then use these normalized rewards for weighting experience samples in the policy update (see Sect. 4) and to identify contexts in which we can make greater learning progress (see Sect. 5)”; Because the learning progress is used to select the context, “s”, and the context is used in calculating the “revised evaluation”, the learning progress is used in calculating the “revised evaluation”); and
update the policy using the observation information, the control, the difficulty to be set to the target system, and the revised evaluation (Fabisch, Page 371, Section 3.1, Paragraph 3, Lines 5-8, “We can then use these normalized rewards for weighting experience samples in the policy update (see Sect. 4) and to identify contexts in which we can make greater learning progress (see Sect. 5)”; The “normalized rewards” are considered to be the “revised evaluation[s]”, further, the revised evaluations are calculated using the difficulty set to the target system, and the control; Fabisch, Page 372, Section 4, Lines 1-7, “Contextual policy search methods like C-REPS perform a search through the space of policies where updates of the policy are done such that one moves in the direction of increasing expected return while, at the same time, bounding the loss of information (measured using relative entropy) between the observed data distribution and the data distribution generated by the new policy”; see also Fabisch, Page 372-373, Section 4).
Fabisch does not explicitly teach calculate a moving average or a moving standard deviation of a plurality of accumulated non-adjusted rewards as the learning progress. 
Bello teaches calculate a moving average or a moving standard deviation of a plurality of accumulated non-adjusted rewards as the learning progress (Bello, Paragraph 0078, “the system trains the controller neural network to maximize the expected reward using a policy gradient technique. For example, the policy gradient technique can be a REINFORCE technique or a Proximal Policy Optimization (PPO) technique. For either technique, the system can use the exponential moving average of previous rewards as a baseline in order to stabilize the training”). 
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention to have modified the learning progress taught by Fabisch to include calculating a moving average of accumulated non-adjusted rewards as the learning progress as taught by Bello. The motivation to do so would have been the using a moving average of rewards stabilizes the training of the model (Bello, Paragraph 0078). 

Regarding claim 2, the rejection of claim 1 is incorporated, and further, the proposed combination teaches wherein the one or more processors are further configured to execute the software instructions to determine the control and the difficulty to be set to the target system further using the learning progress (Fabisch, Page 375, Section 5, Paragraph 2, Lines 1-8, “Some of the evaluated intrinsic reward heuristics use a baseline to estimate the learning progress so that it is comparable between different contexts. This baseline can be an estimate of the upper boundary of the reward in the corresponding context; for instance, the monotonic progress heuristic computes the learning progress depending on the maximum of the previous rewards b as                         
                            m
                            a
                            x
                            ⁡
                            (
                            0
                            ,
                             
                            r
                            
                                    s
                                    ,
                                     
                                    a
                                
                            -
                            b
                            )
                        
                    . We can now replace b by                         
                            
                                    V
                                
                                ^
                            
                            (
                            s
                            )
                        
                    ”; The learning progress is used to select the context, which is used to determine the control and the difficulty).

Regarding claim 3, the rejection of claim 1 is incorporated, and further, the proposed combination teaches the one or more processors are further configured to execute the software instructions to determine the difficulty to be set to the target system, using the learning progress and the difficulty to be set to the target system (Fabisch, Page 371, Section 3.2, Line 1, “We are given a set of observations                         
                            D
                            =
                            
                                    {
                                    
                                                    s
                                                
                                                    i
                                                
                                            ,
                                             
                                                    r
                                                
                                                    i
                                                
                                    }
                                
                                    i
                                    =
                                    1
                                
                                    n
                                
                    ”; Fabisch, Page 371, Paragraph 2, Lines 1-4, “We are interested in inferring the function V from observations D, i.e., learn an estimate                         
                            
                                    V
                                
                                ^
                            
                     of V. One natural constraint on the estimate is that                         
                            
                                    V
                                
                                ^
                            
                            (
                            
                                    s
                                
                                    i
                                
                            )
                            ≥
                            
                                    r
                                
                                    i
                                
                    , i.e.,                         
                            
                                    V
                                
                                ^
                            
                     shall be an upper boundary on D”; The “upper boundary” is considered to be the “difficulty to be set to the target system”; Fabisch, Page 372, Paragraph 2, Lines 7-10, “We propose now a new method, the positive upper boundary support vector estimation (PUBSVE), for learning such a model of the upper boundary”; Fabisch, Page 375, Section 5, Paragraph 2, Lines 1-8, “Some of the evaluated intrinsic reward heuristics use a baseline to estimate the learning progress so that it is comparable between different contexts. This baseline can be an estimate of the upper boundary of the reward in the corresponding context; for instance, the monotonic progress heuristic computes the learning progress depending on the maximum of the previous rewards b as                         
                            m
                            a
                            x
                            ⁡
                            (
                            0
                            ,
                             
                            r
                            
                                    s
                                    ,
                                     
                                    a
                                
                            -
                            b
                            )
                        
                    . We can now replace b by                         
                            
                                    V
                                
                                ^
                            
                            (
                            s
                            )
                        
                    ”; The learning progress is used to select the context, which is used to determine the difficulty).

Regarding claim 5, Fabisch teaches A learning method, implemented by a processor, learning a policy that determines how to control a target system that is a robot system controlling individual devices that make up a robot (Fabisch, Page 369, Abstract, Lines 1-7, “Contextual policy search is a reinforcement learning approach for multi-task learning in the context of robot control learning. It can be used to learn versatilely applicable skills that generalize over a range of tasks specified by a context vector. In this work, we combine contextual policy search with ideas from active learning for selecting the task in which the next trial will be performed”; Fabisch, Page 374, Section 4.3, Line 1, “We use the simulated arm of the robot Artemis”; The “arm of the robot Artemis” is considered to be the “robot system” and the joints are considered to be the “individual devices of a robot”; Fabisch, Page 374, Paragraph 2, Lines 1-2, “We perform 50 runs with 10,000 episodes. The learning curves are displayed in Fig. 2b”; It would be obvious to a person of ordinary skill in the art that this must be performed on a computer, which provides evidence for a processor), comprising
determining control to be applied to the target system and difficulty to be set to the target system using observation information which is obtainable from the individual devices of the robot and numerical values of elements for determining difficulty of the control, according to the policy (Fabisch, Page 371, Section 3.2, Line 1, “We are given a set of observations                         
                            D
                            =
                            
                                    {
                                    
                                                    s
                                                
                                                    i
                                                
                                            ,
                                             
                                                    r
                                                
                                                    i
                                                
                                    }
                                
                                    i
                                    =
                                    1
                                
                                    n
                                
                    ”; Fabisch, Page 371, Paragraph 2, Lines 1-4, “We are interested in inferring the function V from observations D, i.e., learn an estimate                         
                            
                                    V
                                
                                ^
                            
                     of V. One natural constraint on the estimate is that                         
                            
                                    V
                                
                                ^
                            
                            (
                            
                                    s
                                
                                    i
                                
                            )
                            ≥
                            
                                    r
                                
                                    i
                                
                    , i.e.,                         
                            
                                    V
                                
                                ^
                            
                     shall be an upper boundary on D”; The “upper boundary” is considered to be the “difficulty to be set to the target system”; Fabisch, Page 372, Paragraph 2, Lines 7-10, “We propose now a new method, the positive upper boundary support vector estimation (PUBSVE), for learning such a model of the upper boundary”; Fabisch, Page 374, Section 4.2, Lines 1-6, “The catapult benchmark problem has been introduced by da Silva et al. [15]. It is shown in Fig. 3a. The goal is to learn an upper-level policy that generates appropriate actions                         
                            
                                    a
                                
                                    i
                                
                    , consisting of the angle                         
                            
                                    θ
                                
                                    i
                                
                            ∈
                            [
                            0
                            ,
                            
                                    π
                                
                                    2
                                
                            ]
                        
                     and velocity                         
                            
                                    v
                                
                                    i
                                
                            ∈
                            [
                            5,10
                            ]
                        
                     of the catapult’s shot, such that a specific target position, the context                         
                            
                                    s
                                
                                    i
                                
                            ∈
                            [
                            2,10
                            ]
                        
                    , is hit”; The “actions” are considered to be the “control[s]”; Fabisch, Page 371, Section 3.1, Paragraph 2, Lines 1-2, “Samples (s, a, r(s, a)) are required for contextual policy search”);
calculating learning progress of the policy using a plurality of original evaluations of states before and after transition of the target system and the control, according to the control and the determined difficulty (Fabisch, Page 375, Section 5, Paragraph 2, Lines 1-8, “Some of the evaluated intrinsic reward heuristics use a baseline to estimate the learning progress so that it is comparable between different contexts. This baseline can be an estimate of the upper boundary of the reward in the corresponding context; for instance, the monotonic progress heuristic computes the learning progress depending on the maximum of the previous rewards b as                         
                            m
                            a
                            x
                            ⁡
                            (
                            0
                            ,
                             
                            r
                            
                                    s
                                    ,
                                     
                                    a
                                
                            -
                            b
                            )
                        
                    . We can now replace b by                         
                            
                                    V
                                
                                ^
                            
                            (
                            s
                            )
                        
                    ”; Because the learning progress is calculated based on “previous rewards” and the rewards are considered to be the “original evaluations”, it is considered to be “before and after transition”);
…
calculating revised evaluation using an original evaluation, the determined difficulty, and the learning progress (Fabisch, Page 372, Section 3.3, Lines 1-5, “In this section, we describe how PUBSVE allows to normalize the obtained reward: for a given context s, we map the PUBSVE prediction                         
                            
                                    V
                                
                                ^
                            
                            (
                            s
                            )
                        
                     onto 1 and the context’s typical reward level                         
                            
                                    r
                                
                                ~
                            
                            (
                            s
                            )
                        
                     onto 0. This can be achieved via                         
                            
                                            r
                                        
                                            i
                                        
                                -
                            
                            =
                            
                                            r
                                        
                                            i
                                        
                                    -
                                    
                                            r
                                        
                                        ~
                                    
                                    (
                                    
                                            s
                                        
                                            i
                                        
                                    )
                                
                                            V
                                        
                                        ^
                                    
                                                    s
                                                
                                                    i
                                                
                                    -
                                    
                                            r
                                        
                                        ~
                                    
                                    (
                                    
                                            s
                                        
                                            i
                                        
                                    )
                                
                    ”;                         
                            
                                            r
                                        
                                            i
                                        
                                -
                            
                     is considered to be the “revised evaluation”; Fabisch, Page 371, Section 3.1, Paragraph 3, Lines 5-8, “We can then use these normalized rewards for weighting experience samples in the policy update (see Sect. 4) and to identify contexts in which we can make greater learning progress (see Sect. 5)”; Because the learning progress is used to select the context, “s”, and the context is used in calculating the “revised evaluation”, the learning progress is used in calculating the “revised evaluation”); and
updating the policy using the observation information, the control, the determined difficulty, and the revised evaluation (Fabisch, Page 371, Section 3.1, Paragraph 3, Lines 5-8, “We can then use these normalized rewards for weighting experience samples in the policy update (see Sect. 4) and to identify contexts in which we can make greater learning progress (see Sect. 5)”; The “normalized rewards” are considered to be the “revised evaluation[s]”, further, the revised evaluations are calculated using the difficulty set to the target system, and the control; Fabisch, Page 372, Section 4, Lines 1-7, “Contextual policy search methods like C-REPS perform a search through the space of policies where updates of the policy are done such that one moves in the direction of increasing expected return while, at the same time, bounding the loss of information (measured using relative entropy) between the observed data distribution and the data distribution generated by the new policy”; see also Fabisch, Page 372-373, Section 4).
Fabisch does not explicitly teach calculating a moving average or a moving standard deviation of a plurality of accumulated non-adjusted rewards as the learning progress. 
Bello teaches calculating a moving average or a moving standard deviation of a plurality of accumulated non-adjusted rewards as the learning progress (Bello, Paragraph 0078, “the system trains the controller neural network to maximize the expected reward using a policy gradient technique. For example, the policy gradient technique can be a REINFORCE technique or a Proximal Policy Optimization (PPO) technique. For either technique, the system can use the exponential moving average of previous rewards as a baseline in order to stabilize the training”). 
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention to have modified the learning progress taught by Fabisch to include calculating a moving average of accumulated non-adjusted rewards as the learning progress as taught by Bello. The motivation to do so would have been the using a moving average of rewards stabilizes the training of the model (Bello, Paragraph 0078). 

Regarding claim 6, Fabisch teaches A non-transitory computer readable recording medium storing a learning program for learning a policy that determines how to control a target system that is a robot system controlling individual devices that make up a robot (Fabisch, Page 369, Abstract, Lines 1-7, “Contextual policy search is a reinforcement learning approach for multi-task learning in the context of robot control learning. It can be used to learn versatilely applicable skills that generalize over a range of tasks specified by a context vector. In this work, we combine contextual policy search with ideas from active learning for selecting the task in which the next trial will be performed”; Fabisch, Page 374, Section 4.3, Line 1, “We use the simulated arm of the robot Artemis”; The “arm of the robot Artemis” is considered to be the “robot system” and the joints are considered to be the “individual devices of a robot”; Fabisch, Page 374, Paragraph 2, Lines 1-2, “We perform 50 runs with 10,000 episodes. The learning curves are displayed in Fig. 2b”; It would be obvious to a person of ordinary skill in the art that this must be performed on a computer, which provides evidence for a non-transitory computer readable recording medium, and a learning program), wherein the learning program causes a computer to execute:
a process of determining control to be applied to the target system and difficulty to be set to the target system using observation information which is obtainable from the individual devices of the robot and numerical values of elements for determining difficulty of the control, according to the policy (Fabisch, Page 371, Section 3.2, Line 1, “We are given a set of observations                         
                            D
                            =
                            
                                    {
                                    
                                                    s
                                                
                                                    i
                                                
                                            ,
                                             
                                                    r
                                                
                                                    i
                                                
                                    }
                                
                                    i
                                    =
                                    1
                                
                                    n
                                
                    ”; Fabisch, Page 371, Paragraph 2, Lines 1-4, “We are interested in inferring the function V from observations D, i.e., learn an estimate                         
                            
                                    V
                                
                                ^
                            
                     of V. One natural constraint on the estimate is that                         
                            
                                    V
                                
                                ^
                            
                            (
                            
                                    s
                                
                                    i
                                
                            )
                            ≥
                            
                                    r
                                
                                    i
                                
                    , i.e.,                         
                            
                                    V
                                
                                ^
                            
                     shall be an upper boundary on D”; The “upper boundary” is considered to be the “difficulty to be set to the target system”; Fabisch, Page 372, Paragraph 2, Lines 7-10, “We propose now a new method, the positive upper boundary support vector estimation (PUBSVE), for learning such a model of the upper boundary”; Fabisch, Page 374, Section 4.2, Lines 1-6, “The catapult benchmark problem has been introduced by da Silva et al. [15]. It is shown in Fig. 3a. The goal is to learn an upper-level policy that generates appropriate actions                         
                            
                                    a
                                
                                    i
                                
                    , consisting of the angle                         
                            
                                    θ
                                
                                    i
                                
                            ∈
                            [
                            0
                            ,
                            
                                    π
                                
                                    2
                                
                            ]
                        
                     and velocity                         
                            
                                    v
                                
                                    i
                                
                            ∈
                            [
                            5,10
                            ]
                        
                     of the catapult’s shot, such that a specific target position, the context                         
                            
                                    s
                                
                                    i
                                
                            ∈
                            [
                            2,10
                            ]
                        
                    , is hit”; The “actions” are considered to be the “control[s]”; Fabisch, Page 371, Section 3.1, Paragraph 2, Lines 1-2, “Samples (s, a, r(s, a)) are required for contextual policy search”);
a process of calculating learning progress of the policy using a plurality of original evaluations of states before and after transition of the target system and the control, according to the control and the determined difficulty (Fabisch, Page 375, Section 5, Paragraph 2, Lines 1-8, “Some of the evaluated intrinsic reward heuristics use a baseline to estimate the learning progress so that it is comparable between different contexts. This baseline can be an estimate of the upper boundary of the reward in the corresponding context; for instance, the monotonic progress heuristic computes the learning progress depending on the maximum of the previous rewards b as                         
                            m
                            a
                            x
                            ⁡
                            (
                            0
                            ,
                             
                            r
                            
                                    s
                                    ,
                                     
                                    a
                                
                            -
                            b
                            )
                        
                    . We can now replace b by                         
                            
                                    V
                                
                                ^
                            
                            (
                            s
                            )
                        
                    ”; Because the learning progress is calculated based on “previous rewards” and the rewards are considered to be the “original evaluations”, it is considered to be “before and after transition”);
…
a process of calculating revised evaluation using an original evaluation, the determined difficulty, and the learning progress (Fabisch, Page 372, Section 3.3, Lines 1-5, “In this section, we describe how PUBSVE allows to normalize the obtained reward: for a given context s, we map the PUBSVE prediction                         
                            
                                    V
                                
                                ^
                            
                            (
                            s
                            )
                        
                     onto 1 and the context’s typical reward level                         
                            
                                    r
                                
                                ~
                            
                            (
                            s
                            )
                        
                     onto 0. This can be achieved via                         
                            
                                            r
                                        
                                            i
                                        
                                -
                            
                            =
                            
                                            r
                                        
                                            i
                                        
                                    -
                                    
                                            r
                                        
                                        ~
                                    
                                    (
                                    
                                            s
                                        
                                            i
                                        
                                    )
                                
                                            V
                                        
                                        ^
                                    
                                                    s
                                                
                                                    i
                                                
                                    -
                                    
                                            r
                                        
                                        ~
                                    
                                    (
                                    
                                            s
                                        
                                            i
                                        
                                    )
                                
                    ”;                         
                            
                                            r
                                        
                                            i
                                        
                                -
                            
                     is considered to be the “revised evaluation”; Fabisch, Page 371, Section 3.1, Paragraph 3, Lines 5-8, “We can then use these normalized rewards for weighting experience samples in the policy update (see Sect. 4) and to identify contexts in which we can make greater learning progress (see Sect. 5)”; Because the learning progress is used to select the context, “s”, and the context is used in calculating the “revised evaluation”, the learning progress is used in calculating the “revised evaluation”); and
a process of updating the policy using the observation information, the control, the determined difficulty, and the revised evaluation (Fabisch, Page 371, Section 3.1, Paragraph 3, Lines 5-8, “We can then use these normalized rewards for weighting experience samples in the policy update (see Sect. 4) and to identify contexts in which we can make greater learning progress (see Sect. 5)”; The “normalized rewards” are considered to be the “revised evaluation[s]”, further, the revised evaluations are calculated using the difficulty set to the target system, and the control; Fabisch, Page 372, Section 4, Lines 1-7, “Contextual policy search methods like C-REPS perform a search through the space of policies where updates of the policy are done such that one moves in the direction of increasing expected return while, at the same time, bounding the loss of information (measured using relative entropy) between the observed data distribution and the data distribution generated by the new policy”; see also Fabisch, Page 372-373, Section 4).
Fabisch does not explicitly teach a process of calculating a moving average or a moving standard deviation of a plurality of accumulated non-adjusted rewards as the learning progress. 
Bello teaches a process of calculating a moving average or a moving standard deviation of a plurality of accumulated non-adjusted rewards as the learning progress (Bello, Paragraph 0078, “the system trains the controller neural network to maximize the expected reward using a policy gradient technique. For example, the policy gradient technique can be a REINFORCE technique or a Proximal Policy Optimization (PPO) technique. For either technique, the system can use the exponential moving average of previous rewards as a baseline in order to stabilize the training”). 
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention to have modified the learning progress taught by Fabisch to include calculating a moving average of accumulated non-adjusted rewards as the learning progress as taught by Bello. The motivation to do so would have been the using a moving average of rewards stabilizes the training of the model (Bello, Paragraph 0078). 

Regarding claim 7, the rejection of claim 2 is incorporated, and further, the proposed combination teaches wherein the one or more processors are further configured to execute the software instructions to determine the difficulty to be set to the target system, using the learning progress and the difficulty to be set to the target system (Fabisch, Page 371, Section 3.2, Line 1, “We are given a set of observations                         
                            D
                            =
                            
                                    {
                                    
                                                    s
                                                
                                                    i
                                                
                                            ,
                                             
                                                    r
                                                
                                                    i
                                                
                                    }
                                
                                    i
                                    =
                                    1
                                
                                    n
                                
                    ”; Fabisch, Page 371, Paragraph 2, Lines 1-4, “We are interested in inferring the function V from observations D, i.e., learn an estimate                         
                            
                                    V
                                
                                ^
                            
                     of V. One natural constraint on the estimate is that                         
                            
                                    V
                                
                                ^
                            
                            (
                            
                                    s
                                
                                    i
                                
                            )
                            ≥
                            
                                    r
                                
                                    i
                                
                    , i.e.,                         
                            
                                    V
                                
                                ^
                            
                     shall be an upper boundary on D”; The “upper boundary” is considered to be the “difficulty to be set to the target system”; Fabisch, Page 372, Paragraph 2, Lines 7-10, “We propose now a new method, the positive upper boundary support vector estimation (PUBSVE), for learning such a model of the upper boundary”; Fabisch, Page 375, Section 5, Paragraph 2, Lines 1-8, “Some of the evaluated intrinsic reward heuristics use a baseline to estimate the learning progress so that it is comparable between different contexts. This baseline can be an estimate of the upper boundary of the reward in the corresponding context; for instance, the monotonic progress heuristic computes the learning progress depending on the maximum of the previous rewards b as                         
                            m
                            a
                            x
                            ⁡
                            (
                            0
                            ,
                             
                            r
                            
                                    s
                                    ,
                                     
                                    a
                                
                            -
                            b
                            )
                        
                    . We can now replace b by                         
                            
                                    V
                                
                                ^
                            
                            (
                            s
                            )
                        
                    ”; The learning progress is used to select the context, which is used to determine the difficulty). 

Regarding claim 10, the rejection of claim 1 is incorporated, and further, the proposed combination teaches wherein the one or more processors are configured to execute the software instructions to simultaneously determine and output the control and the difficulty via the policy (Fabisch, Page 371, Section 3.1, Paragraph 2, Lines 1-4, “Samples (s, a, r(s, a)) are required for contextual policy search to update the estimate of p, i.e. we perform trials by testing parameters a in context s and observing the corresponding reward r(s, a)”; Fabisch, Page 3.3, Paragraph 2, Lines 1-8, “Since the PUBSVE needs to be updated once new samples arrive, an incremental training procedure is desirable. For this, instead of training the PUBSVE on the whole set                         
                            D
                            =
                            
                                    {
                                    
                                                    s
                                                
                                                    i
                                                
                                            ,
                                             
                                                    r
                                                
                                                    i
                                                
                                    }
                                
                                    i
                                    =
                                    1
                                
                                    n
                                
                    , we can update it incrementally [20], where we forget every old example (                        
                            
                                    s
                                
                                    i
                                
                            ,
                             
                                    r
                                
                                    i
                                
                    ) except the support vectors                         
                            
                                    s
                                
                                    i
                                
                     with                         
                            
                                    α
                                
                                    i
                                
                            >
                            0
                        
                    , collect new samples, and use the new samples and the retained support vectors to update                         
                            
                                    V
                                
                                ^
                            
                    ”). 

Regarding claim 11, the rejection of claim 1 is incorporated, and further, the proposed combination teaches wherein the one or more processors are configured to execute the software instructions to use an extended observation including the determined difficulty and the learning progress of the policy, in addition to the observation information, as an input to the policy (Fabisch, Page 373, Paragraph 3, Lines 13-18, “we suggest to use a softmax for training set selection instead of the maximum, which means that each of the samples (                        
                            
                                    s
                                
                                    i
                                
                            ,
                             
                                    a
                                
                                    i
                                
                            ,
                             
                                    r
                                
                                    i
                                
                            )
                             
                    that we observed will be selected with probability                         
                            p
                            
                                            s
                                        
                                            i
                                        
                                    ,
                                     
                                            a
                                        
                                            i
                                        
                                    ,
                                     
                                            r
                                        
                                            i
                                        
                            =
                            
                                    e
                                    x
                                    p
                                    ⁡
                                    (
                                    τ
                                    
                                                    r
                                                
                                                    i
                                                
                                        -
                                    
                                    )
                                
                                            ∑
                                            
                                                j
                                            
                                            (
                                            τ
                                        
                                                    r
                                                
                                                    j
                                                
                                        -
                                    
                                    )
                                
                    , where                         
                            τ
                            =
                            10
                        
                     in our experiments and                         
                            
                                            r
                                        
                                            i
                                        
                                -
                            
                     is the normalized reward (see Sect. 3.3)”; Fabisch, Page 372, Section 3.3, Lines 1-5, “In this section, we describe how PUBSVE allows to normalize the obtained reward: for a given context s, we map the PUBSVE prediction                         
                            
                                    V
                                
                                ^
                            
                            (
                            s
                            )
                        
                     onto 1 and the context’s typical reward level                         
                            
                                    r
                                
                                ~
                            
                            (
                            s
                            )
                        
                     onto 0. This can be achieved via                         
                            
                                            r
                                        
                                            i
                                        
                                -
                            
                            =
                            
                                            r
                                        
                                            i
                                        
                                    -
                                    
                                            r
                                        
                                        ~
                                    
                                    (
                                    
                                            s
                                        
                                            i
                                        
                                    )
                                
                                            V
                                        
                                        ^
                                    
                                                    s
                                                
                                                    i
                                                
                                    -
                                    
                                            r
                                        
                                        ~
                                    
                                    (
                                    
                                            s
                                        
                                            i
                                        
                                    )
                                
                    ”; “                        
                            
                                    V
                                
                                ^
                            
                                            s
                                        
                                            i
                                        
                    ”is used to calculate “                         
                            
                                            r
                                        
                                            i
                                        
                                -
                            
                    ” and “                        
                            
                                    s
                                
                                    i
                                
                    ”  is determined by the learning progress, thus because the training set is determined based on these values, they are considered “an input to the policy”).

	Regarding claim 12, the rejection of claim 1 is incorporated, and further, the proposed combination teaches wherein the one or more processors are configured to execute the software instructions to calculate the revised evaluation by increasing or decreasing the original evaluation according to the determined difficulty and the calculated learning progress (Fabisch, Page 372, Section 3.3, Lines 1-5, “In this section, we describe how PUBSVE allows to normalize the obtained reward: for a given context s, we map the PUBSVE prediction                         
                            
                                    V
                                
                                ^
                            
                            (
                            s
                            )
                        
                     onto 1 and the context’s typical reward level                         
                            
                                    r
                                
                                ~
                            
                            (
                            s
                            )
                        
                     onto 0. This can be achieved via                         
                            
                                            r
                                        
                                            i
                                        
                                -
                            
                            =
                            
                                            r
                                        
                                            i
                                        
                                    -
                                    
                                            r
                                        
                                        ~
                                    
                                    (
                                    
                                            s
                                        
                                            i
                                        
                                    )
                                
                                            V
                                        
                                        ^
                                    
                                                    s
                                                
                                                    i
                                                
                                    -
                                    
                                            r
                                        
                                        ~
                                    
                                    (
                                    
                                            s
                                        
                                            i
                                        
                                    )
                                
                    ”; “                        
                            
                                    V
                                
                                ^
                            
                                            s
                                        
                                            i
                                        
                            -
                            
                                    r
                                
                                ~
                            
                            (
                            
                                    s
                                
                                    i
                                
                            )
                        
                    ” is considered to perform the increasing or decreasing and, because it includes “                        
                            
                                    V
                                
                                ^
                            
                                            s
                                        
                                            i
                                        
                    ”, and “                        
                            
                                    s
                                
                                    i
                                
                    ” is determined using the learning progress, the original evaluation is increased or decreased according to the determined difficulty and the learning progress).

	Regarding claim 13, the rejection of claim 11 is incorporated, and further, the proposed combination teaches wherein the one or more processors are configured to execute the software instructions to generate an extended action including the control and the difficulty as elements (Fabisch, Page 373, Paragraph 3, Lines 13-18, “we suggest to use a softmax for training set selection instead of the maximum, which means that each of the samples (                        
                            
                                    s
                                
                                    i
                                
                            ,
                             
                                    a
                                
                                    i
                                
                            ,
                             
                                    r
                                
                                    i
                                
                            )
                             
                    that we observed will be selected with probability                         
                            p
                            
                                            s
                                        
                                            i
                                        
                                    ,
                                     
                                            a
                                        
                                            i
                                        
                                    ,
                                     
                                            r
                                        
                                            i
                                        
                            =
                            
                                    e
                                    x
                                    p
                                    ⁡
                                    (
                                    τ
                                    
                                                    r
                                                
                                                    i
                                                
                                        -
                                    
                                    )
                                
                                            ∑
                                            
                                                j
                                            
                                            (
                                            τ
                                        
                                                    r
                                                
                                                    j
                                                
                                        -
                                    
                                    )
                                
                    , where                         
                            τ
                            =
                            10
                        
                     in our experiments and                         
                            
                                            r
                                        
                                            i
                                        
                                -
                            
                     is the normalized reward (see Sect. 3.3)”; Fabisch, Page 372, Section 3.3, Lines 1-5, “In this section, we describe how PUBSVE allows to normalize the obtained reward: for a given context s, we map the PUBSVE prediction                         
                            
                                    V
                                
                                ^
                            
                            (
                            s
                            )
                        
                     onto 1 and the context’s typical reward level                         
                            
                                    r
                                
                                ~
                            
                            (
                            s
                            )
                        
                     onto 0. This can be achieved via                         
                            
                                            r
                                        
                                            i
                                        
                                -
                            
                            =
                            
                                            r
                                        
                                            i
                                        
                                    -
                                    
                                            r
                                        
                                        ~
                                    
                                    (
                                    
                                            s
                                        
                                            i
                                        
                                    )
                                
                                            V
                                        
                                        ^
                                    
                                                    s
                                                
                                                    i
                                                
                                    -
                                    
                                            r
                                        
                                        ~
                                    
                                    (
                                    
                                            s
                                        
                                            i
                                        
                                    )
                                
                    ”; “                        
                            
                                    V
                                
                                ^
                            
                                            s
                                        
                                            i
                                        
                    ”is used to calculate “                         
                            
                                            r
                                        
                                            i
                                        
                                -
                            
                    ”, and “                        
                            
                                    s
                                
                                    i
                                
                    ”  is determined by the learning progress, thus because the training set of samples is determined based on these values, each sample is considered an “extended action” and the control and difficulty are used in their selection and thus are considered “elements” of them). 

	Regarding claim 14, the rejection of claim 11 is incorporated, and further, the proposed combination teaches wherein the one or more processors are configured to execute the software instructions to use an extended observation obtained by concatenating the observation information, the determined difficulty, and the learning progress as an input to the policy (Fabisch, Page 373, Paragraph 3, Lines 13-18, “we suggest to use a softmax for training set selection instead of the maximum, which means that each of the samples (                        
                            
                                    s
                                
                                    i
                                
                            ,
                             
                                    a
                                
                                    i
                                
                            ,
                             
                                    r
                                
                                    i
                                
                            )
                             
                    that we observed will be selected with probability                         
                            p
                            
                                            s
                                        
                                            i
                                        
                                    ,
                                     
                                            a
                                        
                                            i
                                        
                                    ,
                                     
                                            r
                                        
                                            i
                                        
                            =
                            
                                    e
                                    x
                                    p
                                    ⁡
                                    (
                                    τ
                                    
                                                    r
                                                
                                                    i
                                                
                                        -
                                    
                                    )
                                
                                            ∑
                                            
                                                j
                                            
                                            (
                                            τ
                                        
                                                    r
                                                
                                                    j
                                                
                                        -
                                    
                                    )
                                
                    , where                         
                            τ
                            =
                            10
                        
                     in our experiments and                         
                            
                                            r
                                        
                                            i
                                        
                                -
                            
                     is the normalized reward (see Sect. 3.3)”; Fabisch, Page 372, Section 3.3, Lines 1-5, “In this section, we describe how PUBSVE allows to normalize the obtained reward: for a given context s, we map the PUBSVE prediction                         
                            
                                    V
                                
                                ^
                            
                            (
                            s
                            )
                        
                     onto 1 and the context’s typical reward level                         
                            
                                    r
                                
                                ~
                            
                            (
                            s
                            )
                        
                     onto 0. This can be achieved via                         
                            
                                            r
                                        
                                            i
                                        
                                -
                            
                            =
                            
                                            r
                                        
                                            i
                                        
                                    -
                                    
                                            r
                                        
                                        ~
                                    
                                    (
                                    
                                            s
                                        
                                            i
                                        
                                    )
                                
                                            V
                                        
                                        ^
                                    
                                                    s
                                                
                                                    i
                                                
                                    -
                                    
                                            r
                                        
                                        ~
                                    
                                    (
                                    
                                            s
                                        
                                            i
                                        
                                    )
                                
                    ”; “concatenating” is given its plain meaning under the broadest reasonable interpretation which includes a connected series of events; “                        
                            
                                    V
                                
                                ^
                            
                                            s
                                        
                                            i
                                        
                    ”is used to calculate “                         
                            
                                            r
                                        
                                            i
                                        
                                -
                            
                    ” and “                        
                            
                                    s
                                
                                    i
                                
                    ”  is determined by the learning progress, thus because the training set is determined based on these values, they are considered “an input to the policy”).

	Regarding claim 15, the rejection of claim 12 is incorporated, and further, the proposed combination teaches wherein the one or more processors are configured to execute the software instructions to calculate the revised evaluation by multiplying the original evaluation by an adjustment coefficient corresponding to the determined difficulty and the calculated learning progress (Fabisch, Page 372, Section 3.3, Lines 1-5, “In this section, we describe how PUBSVE allows to normalize the obtained reward: for a given context s, we map the PUBSVE prediction                         
                            
                                    V
                                
                                ^
                            
                            (
                            s
                            )
                        
                     onto 1 and the context’s typical reward level                         
                            
                                    r
                                
                                ~
                            
                            (
                            s
                            )
                        
                     onto 0. This can be achieved via                         
                            
                                            r
                                        
                                            i
                                        
                                -
                            
                            =
                            
                                            r
                                        
                                            i
                                        
                                    -
                                    
                                            r
                                        
                                        ~
                                    
                                    (
                                    
                                            s
                                        
                                            i
                                        
                                    )
                                
                                            V
                                        
                                        ^
                                    
                                                    s
                                                
                                                    i
                                                
                                    -
                                    
                                            r
                                        
                                        ~
                                    
                                    (
                                    
                                            s
                                        
                                            i
                                        
                                    )
                                
                    ”;                         
                            
                                            r
                                        
                                            i
                                        
                                -
                            
                     is considered to be the “revised evaluation”, and “                        
                            
                                    V
                                
                                ^
                            
                                            s
                                        
                                            i
                                        
                            -
                            
                                    r
                                
                                ~
                            
                            (
                            
                                    s
                                
                                    i
                                
                            )
                        
                    ” is considered to be the adjustment coefficient, because “                        
                            
                                    s
                                
                                    i
                                
                    ” is determined using the learning progress, the adjustment coefficient corresponds to the learning progress).

Regarding claim 16, the rejection of claim 15 is incorporated, and further, the proposed combination teaches wherein the adjustment coefficient is set such that a small decrease rate or a large increase rate is applied regardless of the difficulty when the learning progress is low (Fabisch, Page 372, Section 3.3, Lines 1-5, “In this section, we describe how PUBSVE allows to normalize the obtained reward: for a given context s, we map the PUBSVE prediction                         
                            
                                    V
                                
                                ^
                            
                            (
                            s
                            )
                        
                     onto 1 and the context’s typical reward level                         
                            
                                    r
                                
                                ~
                            
                            (
                            s
                            )
                        
                     onto 0. This can be achieved via                         
                            
                                            r
                                        
                                            i
                                        
                                -
                            
                            =
                            
                                            r
                                        
                                            i
                                        
                                    -
                                    
                                            r
                                        
                                        ~
                                    
                                    (
                                    
                                            s
                                        
                                            i
                                        
                                    )
                                
                                            V
                                        
                                        ^
                                    
                                                    s
                                                
                                                    i
                                                
                                    -
                                    
                                            r
                                        
                                        ~
                                    
                                    (
                                    
                                            s
                                        
                                            i
                                        
                                    )
                                
                    ”;                         
                            
                                            r
                                        
                                            i
                                        
                                -
                            
                     is considered to be the “revised evaluation”, and “                        
                            
                                    V
                                
                                ^
                            
                                            s
                                        
                                            i
                                        
                            -
                            
                                    r
                                
                                ~
                            
                            (
                            
                                    s
                                
                                    i
                                
                            )
                        
                    ” is considered to be the adjustment coefficient; because the learning progress is calculated using the rewards, and the rewards are present in the adjustment coefficient, the increase/decrease rate is affected when the learning progress is low).

Regarding claim 17, the rejection of claim 15 is incorporated, and further, the proposed combination teaches wherein the adjustment coefficient is set such that a larger decrease rate or a smaller increase rate is applied as the difficulty is lower when the learning progress is high (Fabisch, Page 372, Section 3.3, Lines 1-5, “In this section, we describe how PUBSVE allows to normalize the obtained reward: for a given context s, we map the PUBSVE prediction                         
                            
                                    V
                                
                                ^
                            
                            (
                            s
                            )
                        
                     onto 1 and the context’s typical reward level                         
                            
                                    r
                                
                                ~
                            
                            (
                            s
                            )
                        
                     onto 0. This can be achieved via                         
                            
                                            r
                                        
                                            i
                                        
                                -
                            
                            =
                            
                                            r
                                        
                                            i
                                        
                                    -
                                    
                                            r
                                        
                                        ~
                                    
                                    (
                                    
                                            s
                                        
                                            i
                                        
                                    )
                                
                                            V
                                        
                                        ^
                                    
                                                    s
                                                
                                                    i
                                                
                                    -
                                    
                                            r
                                        
                                        ~
                                    
                                    (
                                    
                                            s
                                        
                                            i
                                        
                                    )
                                
                    ”;                         
                            
                                            r
                                        
                                            i
                                        
                                -
                            
                     is considered to be the “revised evaluation”, and “                        
                            
                                    V
                                
                                ^
                            
                                            s
                                        
                                            i
                                        
                            -
                            
                                    r
                                
                                ~
                            
                            (
                            
                                    s
                                
                                    i
                                
                            )
                        
                    ” is considered to be the adjustment coefficient; because the learning progress is calculated using the rewards, and the rewards are present in the adjustment coefficient, the increase/decrease rate is affected when the learning progress is high). 

Regarding claim 18, the rejection of claim 10 is incorporated, and further, the proposed combination teaches wherein the difficulty is converted into an environmental parameter that controls state transition characteristics of the target system (Fabisch, Page 372, Section 3.3, Lines 1-5, “In this section, we describe how PUBSVE allows to normalize the obtained reward: for a given context s, we map the PUBSVE prediction                         
                            
                                    V
                                
                                ^
                            
                            (
                            s
                            )
                        
                     onto 1 and the context’s typical reward level                         
                            
                                    r
                                
                                ~
                            
                            (
                            s
                            )
                        
                     onto 0. This can be achieved via                         
                            
                                            r
                                        
                                            i
                                        
                                -
                            
                            =
                            
                                            r
                                        
                                            i
                                        
                                    -
                                    
                                            r
                                        
                                        ~
                                    
                                    (
                                    
                                            s
                                        
                                            i
                                        
                                    )
                                
                                            V
                                        
                                        ^
                                    
                                                    s
                                                
                                                    i
                                                
                                    -
                                    
                                            r
                                        
                                        ~
                                    
                                    (
                                    
                                            s
                                        
                                            i
                                        
                                    )
                                
                    ”;                         
                            
                                            r
                                        
                                            i
                                        
                                -
                            
                     is considered to be the “revised evaluation” and is also considered an “environmental parameter”, which is used in training data selection, thus affecting “state transition characteristics of the target system” because selecting different training data will affect the learning of the policy). 

Regarding claim 19, the rejection of claim 1 is incorporated, and further, the proposed combination teaches wherein the one or more processors are further configured to execute the software instructions to control the target system using the learned policy (Fabisch, Page 374, Section 4.2, Lines 1-6, “The catapult benchmark problem has been introduced by da Silva et al. [15]. It is shown in Fig. 3a. The goal is to learn an upper-level policy that generates appropriate actions                         
                            
                                    a
                                
                                    i
                                
                    , consisting of the angle                         
                            
                                    θ
                                
                                    i
                                
                            ∈
                            [
                            0
                            ,
                            
                                    π
                                
                                    2
                                
                            ]
                        
                     and velocity                         
                            
                                    v
                                
                                    i
                                
                            ∈
                            [
                            5,10
                            ]
                        
                     of the catapult’s shot, such that a specific target position, the context                         
                            
                                    s
                                
                                    i
                                
                            ∈
                            [
                            2,10
                            ]
                        
                    , is hit”).

Claims 4 and 8-9 are rejected under 35 U.S.C. 103 as being unpatentable over Fabisch in view of Bello in further view of Ivanovic et al., BaRC: Backward Reachability Curriculum for Robotic Reinforcement Learning, 09/17/2018, https://arxiv.org/pdf/1806.06161, hereinafter referred to as “Ivanovic”. 

Regarding claim 4, the rejection of claim 1 is incorporated, and further, the proposed combination teaches the one or more processors are further configured to execute the software instructions to repeatedly execute a calculation process of the difficulty to be set to the target system (Fabisch, Page 372, Section 3.3, Paragraph 2, Lines 1-3, “Since the PUBSVE needs to be updated once new samples arrive, an incremental training procedure is desirable”). 
The proposed combination does not explicitly teach repeatedly executing a calculation process until the number of processing times exceeds a threshold value. 
Ivanovic teaches repeatedly executing a calculation process until the number of processing times exceeds a threshold value (Ivanovic, Page 3, Section IV, Paragraph 4, Lines 3-5, “The horizon T used in the BRS computation controls how much the task difficulty increases between stages of the curriculum”; Ivanovic, Page 4, Algorithm 1, Line 7, “starts_set [Wingdings font/0xDF]EXPANDBACKWARDS(starts, M, T)”; Because “starts_set” is only calculated for each iteration of the for loop on line 6, a person of ordinary skill in the art would recognize that once “i” exceeds the termination condition, the difficulty will not be calculated).
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to include modifying the difficulty calculation of the proposed combination to include stopping when the number of processing times exceeds a threshold value as taught by Ivanovic. The motivation to do so would have been to reduce computational complexity and only computing difficulty when necessary (Ivanovic, Page 4, Paragraph 2, see also Ivanovic, Page 4, Algorithm 1). 

Regarding claim 8, the rejection of claim 2 is incorporated, and further, the proposed combination teaches the one or more processors are further configured to execute the software instructions to repeatedly execute a calculation process of the difficulty to be set to the target system (Fabisch, Page 372, Section 3.3, Paragraph 2, Lines 1-3, “Since the PUBSVE needs to be updated once new samples arrive, an incremental training procedure is desirable”). 
The proposed combination does not explicitly teach repeatedly executing a calculation process until the number of processing times exceeds a threshold value. 
Ivanovic teaches repeatedly executing a calculation process until the number of processing times exceeds a threshold value (Ivanovic, Page 3, Section IV, Paragraph 4, Lines 3-5, “The horizon T used in the BRS computation controls how much the task difficulty increases between stages of the curriculum”; Ivanovic, Page 4, Algorithm 1, Line 7, “starts_set [Wingdings font/0xDF]EXPANDBACKWARDS(starts, M, T)”; Because “starts_set” is only calculated for each iteration of the for loop on line 6, a person of ordinary skill in the art would recognize that once “i” exceeds the termination condition, the difficulty will not be calculated).
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to include modifying the difficulty calculation of the proposed combination to include stopping when the number of processing times exceeds a threshold value as taught by Ivanovic. The motivation to do so would have been to reduce computational complexity and only computing difficulty when necessary (Ivanovic, Page 4, Paragraph 2, see also Ivanovic, Page 4, Algorithm 1). 

Regarding claim 9, the rejection of claim 3 is incorporated, and further, the proposed combination teaches the one or more processors are further configured to execute the software instructions to repeatedly execute a calculation process of the difficulty to be set to the target system (Fabisch, Page 372, Section 3.3, Paragraph 2, Lines 1-3, “Since the PUBSVE needs to be updated once new samples arrive, an incremental training procedure is desirable”). 
The proposed combination does not explicitly teach repeatedly executing a calculation process until the number of processing times exceeds a threshold value. 
Ivanovic teaches repeatedly executing a calculation process until the number of processing times exceeds a threshold value (Ivanovic, Page 3, Section IV, Paragraph 4, Lines 3-5, “The horizon T used in the BRS computation controls how much the task difficulty increases between stages of the curriculum”; Ivanovic, Page 4, Algorithm 1, Line 7, “starts_set [Wingdings font/0xDF]EXPANDBACKWARDS(starts, M, T)”; Because “starts_set” is only calculated for each iteration of the for loop on line 6, a person of ordinary skill in the art would recognize that once “i” exceeds the termination condition, the difficulty will not be calculated).
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to include modifying the difficulty calculation of the proposed combination to include stopping when the number of processing times exceeds a threshold value as taught by Ivanovic. The motivation to do so would have been to reduce computational complexity and only computing difficulty when necessary (Ivanovic, Page 4, Paragraph 2, see also Ivanovic, Page 4, Algorithm 1). 

Response to Arguments
Applicant’s arguments regarding the 35 U.S.C. 101 rejections of the claims have been fully considered but are unpersuasive.
Applicant first argues, on page 12 – page 13, paragraph 2 of the response, that the claims are patent eligible. Examiner respectfully disagrees. Applicant specifically points to amended claim 1, “determine control to be applied to the target system and difficulty to be set to the target system using observation information which is obtainable from the individual devices of the robot and numerical values of elements for determining difficulty of the control, according to the policy; calculate learning progress of the policy using a plurality of original evaluations of states before and after transition of the target system and the determined control, according to the control and the difficulty to be set to the target system; calculate a moving average or a moving standard deviation of a plurality of accumulated non-adjusted rewards as the learning progress; calculate revised evaluation using an original evaluation, the determined difficulty, and the learning progress; and update the policy using the observation information, the control, the difficulty to be set to the target system, and the revised evaluation..". Applicant makes no specific claim as to how these limitations integrate the judicial exception into a practical application. Each of these limitations are analyzed in detail in the updated 35 U.S.C. 101 rejection seen above, and none were determined to integrate the judicial exception into a practical application. 
Applicant next argues, on page 13 of the response, that amended claim 1 provides an inventive concept, and does not simply append well-understood, routine, or conventional activities. Applicant makes no specific claim as to which limitations or how any limitations individually or in combination amount to significantly more than the abstract idea. Each limitation of claim 1 are analyzed in detail in the updated 35 U.S.C. 101 rejection seen above, and none were determined to amount to significantly more than the abstract idea.
Applicant's arguments regarding the remainder of the claims rely upon the arguments asserted with respect to the independent claims, and are thus unpersuasive.

Applicant’s arguments regarding the 35 U.S.C. 102 rejections of the claims have been fully considered but are unpersuasive. 
Applicant’s arguments with respect to claim 1 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Applicant's arguments regarding the remainder of the claims rely upon the arguments asserted with respect to the independent claims, and are thus unpersuasive.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to MOLLY CLARKE SIPPEL whose telephone number is (571)272-3270. The examiner can normally be reached Monday - Friday, 7:30 a.m. - 4:30 p.m. ET..
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached at (571)272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/M.C.S./            Examiner, Art Unit 2122                                                                                                                                                                                            

/KAKALI CHAKI/            Supervisory Patent Examiner, Art Unit 2122
Read full office action
Prosecution Timeline

Sep 07, 2022
Application Filed
Jul 23, 2025
Non-Final Rejection — §101, §103, §112
Oct 21, 2025
Examiner Interview Summary
Oct 21, 2025
Applicant Interview (Telephonic)
Oct 28, 2025
Response Filed
Jan 06, 2026
Final Rejection — §101, §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/929,541
Patent 12602592
NOISE COMMUNICATION FOR FEDERATED LEARNING
2y 5m to grant Granted Apr 14, 2026
17/932,941
Patent 12596916
CONSTRAINED MASKING FOR SPARSIFICATION IN MACHINE LEARNING
2y 5m to grant Granted Apr 07, 2026
Study what changed to get past this examiner. Based on 2 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds
Prosecution Projections

3-4
Expected OA Rounds
50%
Grant Probability
99%
With Interview (+58.3%)
3y 7m
Median Time to Grant
Moderate
PTA Risk
Based on 14 resolved cases by this examiner. Grant probability derived from career allow rate.
LEARNING DEVICE, LEARNING METHOD, AND RECORDING MEDIUM

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

LEARNING DEVICE, LEARNING METHOD, AND RECORDING MEDIUM

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email