Office Action Analysis: 18079123 — PROVIDING TRAINED REINFORCEMENT LEARNING SYSTEMS

Examiner Intelligence

SIPPEL, MOLLY CLARKE View full profile →
Grants 50% of resolved cases
Career Allow Rate
7 granted / 14 resolved
-5.0% vs TC avg
Strong +58% interview lift
Without
With
+58.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 7m
Avg Prosecution
25 currently pending
Career history
39
Total Applications
across all art units
Statute-Specific Performance

§101
33.8%
-6.2% vs TC avg
§103
32.0%
-8.0% vs TC avg
§102
9.8%
-30.2% vs TC avg
§112
23.6%
-16.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 14 resolved cases
Office Action

§101 §102 §103 §112
DETAILED ACTION
	This action is responsive to the application filed on 12/12/2022. Claims 1-20 are pending in the case. Claims 1, 8, and 15 are independent claims. 

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement filed 12/12/2022 fails to comply with 37 CFR 1.98(a)(2), which requires a legible copy of each cited foreign patent document; each non-patent literature publication or that portion which caused it to be listed; and all other information or that portion which caused it to be listed.  It has been placed in the application file, but the information referred to therein has not been considered.
The foreign patent documents 109242099, 110351561, and 113893539 have been stricken through and not considered because the kind codes provided in the citations do not match the kind codes of the documents submitted. The non-patent literature document titled “Policy gradient methods for the noisy linear quadratic regulator over a finite horizon” has been stricken through and not considered because the date provided in the citation does not match the date of the document submitted. All other references are being considered by the examiner.
	
The information disclosure statement (IDS) submitted on 09/16/2024 is being considered by the examiner.
	
	Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 1-20 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

	Regarding claim 1, the claim recites “the RL model” in lines 3-4. There is insufficient antecedent basis for this limitation in the claim. The claim also recites “a reinforcement learning (RL) system” in line 1. It is unclear if applicant is attempting to refer to the previously recited claim element or if applicant is attempting to recite a new claim element. For examination purposes, this limitation has been interpreted to mean “a RL model”, reciting a new claim element. Examiner notes that “the RL model” is repeated throughout the remainder of the claims and if the first instance is amended as indicated by Examiner, the remainder would have antecedent basis. 

	Regarding claim 2, the claim recites “the logarithmic loss function” on lines 4-5. The claim also recites “a logarithmic loss function for the RL model” on lines 2-3, and the parent claim recites “a logarithmic loss function for the RL model” on lines 5-6. It is unclear which of the claim elements applicant is attempting to refer to, or if the previously recited claim elements are intended to be the same claim element. For examination purposes, both recitations of “a logarithmic loss function for the RL model” are considered to be the same claim element, of which “the logarithmic loss function” on lines 4-5 refers. 

	Regarding claim 3, the claim recites “the initiation point” on lines 4-5. The claim also recites “an initiation point for the RL model” on lines 3-4, and the parent claim recites “an initiation point for the RL model” on line 6. It is unclear which of the claim elements applicant is attempting to refer to, or if the previously recited claim elements are intended to be the same claim element. For examination purposes, both recitations of “an initiation point for the RL model” are considered to be the same claim element, of which “the initiation point” on lines 4-5 refers.

	Regarding claim 4, the claim recites “the initiation point” on lines 2-3. Claim 3 recites “an initiation point for the RL model” on lines 3-4, and claim 1 recites “an initiation point for the RL model” on line 6. It is unclear which of the claim elements applicant is attempting to refer to, or if the previously recited claim elements are intended to be the same claim element. For examination purposes, all the recitations of “an initiation point for the RL model” are considered to be the same claim element, of which “the initiation point” refers.

	Regarding claim 5, the claim recites “the initiation point” on lines 2-3. Claim 3 recites “an initiation point for the RL model” on lines 3-4, and claim 1 recites “an initiation point for the RL model” on line 6. It is unclear which of the claim elements applicant is attempting to refer to, or if the previously recited claim elements are intended to be the same claim element. For examination purposes, all the recitations of “an initiation point for the RL model” are considered to be the same claim element, of which “the initiation point” refers.
	
	Regarding claim 6, the claim recites: “the logarithmic loss function” on line 8. The claim also recites “a logarithmic loss function for the RL model” on lines 4-5, and the parent claim recites “a logarithmic loss function for the RL model” on lines 5-6. It is unclear which of the claim elements applicant is attempting to refer to, or if the previously recited claim elements are intended to be the same claim element. For examination purposes, both recitations of “a logarithmic loss function for the RL model” are considered to be the same claim element, of which “the logarithmic loss function” on line 8 refers.
	Further, the claim recites “the initiation point” on lines 7-8. The claim also recites “an initiation point for the RL model” on line 5, and the parent claim recites “an initiation point for the RL model” on line 6. It is unclear which of the claim elements applicant is attempting to refer to, or if the previously recited claim elements are intended to be the same claim element. For examination purposes, all the recitations of “an initiation point for the RL model” are considered to be the same claim element, of which “the initiation point” refers. 

	Regarding claim 7, the claim recites “the initiation point” on line 2. Claim 6 recites “an initiation point for the RL model” on line 5, and claim 1 recites “an initiation point for the RL model” on line 6. It is unclear which of the claim elements applicant is attempting to refer to, or if the previously recited claim elements are intended to be the same claim element. For examination purposes, all the recitations of “an initiation point for the RL model” are considered to be the same claim element, of which “the initiation point” refers.

	Regarding claim 8, the claim recites “the RL model” on line 5. There is insufficient antecedent basis for this limitation in the claim. The claim also recites “trained reinforcement learning systems” in line 1. It is unclear if applicant is attempting to refer to the previously recited claim element or if applicant is attempting to recite a new claim element. For examination purposes, this limitation has been interpreted to mean “a RL model”, reciting a new claim element. Examiner notes that “the RL model” is repeated throughout the remainder of the claims and if the first instance is amended as indicated by Examiner, the remainder would have antecedent basis.

	Regarding claim 9, the claim recites “the logarithmic loss function” on line 4. The claim also recites “a logarithmic loss function for the RL model” on line 3, and the parent claim recites “a logarithmic loss function for the RL model” on line 6. It is unclear which of the claim elements applicant is attempting to refer to, or if the previously recited claim elements are intended to be the same claim element. For examination purposes, both recitations of “a logarithmic loss function for the RL model” are considered to be the same claim element, of which “the logarithmic loss function” on line 4 refers.

	Regarding claim 10, the claim recites “the initiation point” on line 5. The claim also recites “an initiation point for the RL model” on line 3, and the parent claim recites “an initiation point for the RL model” on lines 6-7. It is unclear which of the claim elements applicant is attempting to refer to, or if the previously recited claim elements are intended to be the same claim element. For examination purposes, both recitations of “an initiation point for the RL model” are considered to be the same claim element, of which “the initiation point” on line 5 refers.

	Regarding claim 11, the claim recites “the initiation point” on lines 1-2. Claim 10 recites “an initiation point for the RL model” on line 3, and claim 8 recites “an initiation point for the RL model” on lines 6-7. It is unclear which of the claim elements applicant is attempting to refer to, or if the previously recited claim elements are intended to be the same claim element. For examination purposes, all the recitations of “an initiation point for the RL model” are considered to be the same claim element, of which “the initiation point” refers.

	Regarding claim 12, the claim recites “the initiation point” on lines 1-2. Claim 10 recites “an initiation point for the RL model” on line 3, and claim 8 recites “an initiation point for the RL model” on lines 6-7. It is unclear which of the claim elements applicant is attempting to refer to, or if the previously recited claim elements are intended to be the same claim element. For examination purposes, all the recitations of “an initiation point for the RL model” are considered to be the same claim element, of which “the initiation point” refers.

	Regarding claim 13, the claim recites: “the logarithmic loss function” on lines 6-7. The claim also recites “a logarithmic loss function for the RL model” on line 4, and the parent claim recites “a logarithmic loss function for the RL model” on line 6. It is unclear which of the claim elements applicant is attempting to refer to, or if the previously recited claim elements are intended to be the same claim element. For examination purposes, both recitations of “a logarithmic loss function for the RL model” are considered to be the same claim element, of which “the logarithmic loss function” on lines 6-7 refers.
	Further, the claim recites “the initiation point” on line 5. The claim also recites “an initiation point for the RL model” on lines 4-5, and the parent claim recites “an initiation point for the RL model” on lines 6-7. It is unclear which of the claim elements applicant is attempting to refer to, or if the previously recited claim elements are intended to be the same claim element. For examination purposes, all the recitations of “an initiation point for the RL model” are considered to be the same claim element, of which “the initiation point” refers. 

	Regarding claim 14, the claim recites “the initiation point” on lines 1-2. Claim 13 recites “an initiation point for the RL model” on lines 4-5, and claim 8 recites “an initiation point for the RL model” on lines 6-7. It is unclear which of the claim elements applicant is attempting to refer to, or if the previously recited claim elements are intended to be the same claim element. For examination purposes, all the recitations of “an initiation point for the RL model” are considered to be the same claim element, of which “the initiation point” refers.

	Regarding claim 15, the claim recites “the RL model” on line 8. There is insufficient antecedent basis for this limitation in the claim. The claim also recites “a trained reinforcement learning system” in line 1. It is unclear if applicant is attempting to refer to the previously recited claim element or if applicant is attempting to recite a new claim element. For examination purposes, this limitation has been interpreted to mean “a RL model”, reciting a new claim element. Examiner notes that “the RL model” is repeated throughout the remainder of the claims and if the first instance is amended as indicated by Examiner, the remainder would have antecedent basis.

	Regarding claim 16, the claim recites “the logarithmic loss function” on line 4. The claim also recites “a logarithmic loss function for the RL model” on line 3, and the parent claim recites “a logarithmic loss function for the RL model” on line 9. It is unclear which of the claim elements applicant is attempting to refer to, or if the previously recited claim elements are intended to be the same claim element. For examination purposes, both recitations of “a logarithmic loss function for the RL model” are considered to be the same claim element, of which “the logarithmic loss function” on line 4 refers.

	Regarding claim 17, the claim recites “the initiation point” on line 5. The claim also recites “an initiation point for the RL model” on line 3, and the parent claim recites “an initiation point for the RL model” on lines 9-10. It is unclear which of the claim elements applicant is attempting to refer to, or if the previously recited claim elements are intended to be the same claim element. For examination purposes, both recitations of “an initiation point for the RL model” are considered to be the same claim element, of which “the initiation point” on line 5 refers.

	Regarding claim 18, the claim recites “the initiation point” on line 1. Claim 17 recites “an initiation point for the RL model” on line 3, and claim 15 recites “an initiation point for the RL model” on lines 9-10. It is unclear which of the claim elements applicant is attempting to refer to, or if the previously recited claim elements are intended to be the same claim element. For examination purposes, all the recitations of “an initiation point for the RL model” are considered to be the same claim element, of which “the initiation point” refers.

	Regarding claim 19, the claim recites “the initiation point” on line 1. Claim 17 recites “an initiation point for the RL model” on line 3, and claim 15 recites “an initiation point for the RL model” on lines 9-10. It is unclear which of the claim elements applicant is attempting to refer to, or if the previously recited claim elements are intended to be the same claim element. For examination purposes, all the recitations of “an initiation point for the RL model” are considered to be the same claim element, of which “the initiation point” refers.

	Regarding claim 20, the claim recites: “the logarithmic loss function” on lines 6-7. The claim also recites “a logarithmic loss function for the RL model” on line 4, and the parent claim recites “a logarithmic loss function for the RL model” on line 9. It is unclear which of the claim elements applicant is attempting to refer to, or if the previously recited claim elements are intended to be the same claim element. For examination purposes, both recitations of “a logarithmic loss function for the RL model” are considered to be the same claim element, of which “the logarithmic loss function” on lines 6-7 refers.
	Further, the claim recites “the initiation point” on line 6. The claim also recites “an initiation point for the RL model” on line 4, and the parent claim recites “an initiation point for the RL model” on lines 9-10. It is unclear which of the claim elements applicant is attempting to refer to, or if the previously recited claim elements are intended to be the same claim element. For examination purposes, all the recitations of “an initiation point for the RL model” are considered to be the same claim element, of which “the initiation point” refers. 

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 

	Regarding claim 1:
	Step 1 Statutory Category: Claim 1 is directed to a method, which falls within one of the four statutory categories. 
	Step 2A Prong 1 Judicial Exception: Claim 1 recites, in part, “formulating, …, a decision process problem for the RL model”. This limitation, under the broadest reasonable interpretation, covers the recitation of a mathematical concept, see MPEP §2106.04(a)(2)(I). Further, the claim recites: “defining, …, at least one of a logarithmic loss function for the RL model and defining an initiation point for the RL model according to an optimized spectral norm of the RL model”. This limitation, under the broadest reasonable interpretation, covers the recitation of a mathematical concept, see MPEP §2106.04(a)(2)(I).
	Step 2A Prong 2 Integration into a Practical Application: This judicial exception is not integrated into a practical application. In particular the claim recites that the method is “computer-implemented” and that each step is performed “by [the] one or more computer processors”. These limitations are additional elements that amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process.  See MPEP §2106.05(f). Further, the claim recites “a reinforcement learning (RL) system” and “the RL model”. These limitations are additional elements that generally link links the use of the judicial exception to a particular technological environment or field of use.  See MPEP §2106.05(h). Further, the claim recites: “training, …, the system according to the logarithmic loss function or from the initiation point”. This limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP §2106.05(f). Finally, the claim recites: “providing, …, the trained RL model”. This limitation is an additional element that amounts to a post-solution step and as such is considered insignificant extra-solution activity to the judicial exception.  See MPEP §2106.05(g).
Step 2B Significantly More: The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements: that the method is “computer-implemented”, that each step is performed “by [the] one or more computer processors”, and “training, …, the system according to the logarithmic loss function or from the initiation point” amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. Further, the additional elements “a reinforcement learning (RL) system” and “the RL model” generally link the use of the judicial exception to a particular technological environment or field of use. Elements that merely generally link the use of the judicial exception to a particular technological environment or field of use cannot provide an inventive concept. Finally, the additional element “providing, …, the trained RL model” amounts to insignificant extra-solution activity to the judicial exception and is directed to receiving or transmitting data over a network which courts have recognized as well-understood, routine, and conventional when they are claimed in a generic manner, see MPEP §2106.05(d)(II). The claim is not patent eligible. 

Regarding claim 2, the rejection of claim 1 is incorporated, and further, the claim recites: “defining, …, a logarithmic loss function for the RL model”. This limitation recites mathematical concepts in addition to those identified in the rejection of the parent claim. Thus, the claim recites a judicial exception. 
Further, the claim recites: “training, …, the system according to the logarithmic loss function” and “by the one or more computer processors”. These limitations are additional elements that amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP §2106.05(f). Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. Further, the claim recites: “providing, …, the trained RL model”. This limitation is an additional element that amounts to a post-solution step and as such is considered insignificant extra-solution activity to the judicial exception and is directed to receiving or transmitting data over a network which courts have recognized as well-understood, routine, and conventional when they are claimed in a generic manner, see MPEP §2106.05(d)(II). The claim is not patent eligible. 

Regarding claim 3, the rejection of claim 1 is incorporated, and further, the claim recites: “defining, …, an initiation point for the RL model according to the optimized spectral norm”. This limitation recites mathematical concepts in addition to those identified in the rejection of the parent claim. Thus, the claim recites a judicial exception. 
Further, the claim recites: “training, …, the RL model from the initiation point” and “by the one or more computer processors”. These limitations are additional elements that amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP §2106.05(f). Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. The claim is not patent eligible. 

Regarding claim 4, the rejection of claim 3 is incorporated, and further, the claim recites: “wherein defining the initiation point comprises regulating a system spectral radius”. This limitation is a continuation of the “defining, …, an initiation point for the RL model according to the optimized spectral norm” limitation identified as an abstract idea in the rejection of the parent claim. Thus, the claim recites a judicial exception.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 5, the rejection of claim 4 is incorporated, and further, the claim recites: “wherein defining the initiation point comprises defining an initiation point wherein a magnitude of an absolute value of the system spectral radius is less than 1”. This limitation is a continuation of the “defining, …, an initiation point for the RL model according to the optimized spectral norm” limitation identified as an abstract idea in the rejection of the parent claim. Thus, the claim recites a judicial exception.
The claim does not include any additional elements that amount to an integration of the judicial exception into a practical application, nor to significantly more than the judicial exception. The claim is not patent eligible.

Regarding claim 6, the rejection of claim 1 is incorporated, and further, the claim recites: “formulating, …, a decision process problem for the RL model”. This limitation, under the broadest reasonable interpretation, covers the recitation of a mathematical concept, see MPEP §2106.04(a)(2)(I). Further, the claim recites: “defining, …, a logarithmic loss function for the RL model and an initiation point for the RL model according to an optimized spectral norm of the RL model”. This limitation, under the broadest reasonable interpretation, covers the recitation of a mathematical concept, see MPEP §2106.04(a)(2)(I).
Further, the claim recites: that each step is performed “by [the] one or more computer processors”. These limitations are additional elements that amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process.  See MPEP §2106.05(f). Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. Further, the claim recites: “training, …, the system from the initiation point and according to the logarithmic loss function”. This limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP §2106.05(f). Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. Finally, the claim recites: “providing, …, the trained RL model”. This limitation is an additional element that amounts to a post-solution step and as such is considered insignificant extra-solution activity to the judicial exception. See MPEP §2106.05(g). Further, the limitation is directed to receiving or transmitting data over a network which courts have recognized as well-understood, routine, and conventional when they are claimed in a generic manner, see MPEP §2106.05(d)(II). The claim is not patent eligible. 

Regarding claim 7, the rejection of claim 6 is incorporated, and further, the claim recites: “defining, …, the initiation point according to a system spectral radius magnitude of less than 1”. This limitation recites mathematical concepts in addition to those identified in the rejection of the parent claim, and thus the claim recites a judicial exception. 
Further, the claim recites: “by the one or more computer processors”. This limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP §2106.05(f). Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. The claim is not patent eligible. 

Regarding claim 8:
	Step 1 Statutory Category: Claim 8 is directed to an article of manufacture, which falls within one of the four statutory categories. 
	Step 2A Prong 1 Judicial Exception: Claim 8 recites, in part, “formulate a decision process problem for the RL model”. This limitation, under the broadest reasonable interpretation, covers the recitation of a mathematical concept, see MPEP §2106.04(a)(2)(I). Further, the claim recites: “define at least one of a logarithmic loss function for the RL model and defining an initiation point for the RL model according to an optimized spectral norm of the RL model”. This limitation, under the broadest reasonable interpretation, covers the recitation of a mathematical concept, see MPEP §2106.04(a)(2)(I).
	Step 2A Prong 2 Integration into a Practical Application: This judicial exception is not integrated into a practical application. In particular the claim recites “a computer program product”, “one or more computer readable storage media”, “collectively stored program instructions”, and “one or more computer systems”. These limitations are additional elements that amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process.  See MPEP §2106.05(f). Further, the claim recites “reinforcement learning systems” and “the RL model”. These limitations are additional elements that generally link links the use of the judicial exception to a particular technological environment or field of use.  See MPEP §2106.05(h). Further, the claim recites: “train the system according to the logarithmic loss function or from the initiation point”. This limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP §2106.05(f). Finally, the claim recites: “provide the trained RL model”. This limitation is an additional element that amounts to a post-solution step and as such is considered insignificant extra-solution activity to the judicial exception.  See MPEP §2106.05(g).
Step 2B Significantly More: The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements: that the method is “a computer program product”, “one or more computer readable storage media”, “collectively stored program instructions”, “one or more computer systems”, and “train the system according to the logarithmic loss function or from the initiation point” amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. Further, the additional elements “reinforcement learning systems” and “the RL model” generally link the use of the judicial exception to a particular technological environment or field of use. Elements that merely generally link the use of the judicial exception to a particular technological environment or field of use cannot provide an inventive concept. Finally, the additional element “provide the trained RL model” amounts to insignificant extra-solution activity to the judicial exception and is directed to receiving or transmitting data over a network which courts have recognized as well-understood, routine, and conventional when they are claimed in a generic manner, see MPEP §2106.05(d)(II). The claim is not patent eligible.

Regarding claim 9, the rejection of claim 8 is incorporated, and further, claim 9 is substantially similar to claim 2 respectively, and is rejected in the same manner and reasoning applying. 

Regarding claim 10, the rejection of claim 8 is incorporated, and further, claim 10 is substantially similar to claim 3 respectively, and is rejected in the same manner and reasoning applying. 

Regarding claim 11, the rejection of claim 10 is incorporated, and further, claim 11 is substantially similar to claim 4 respectively, and is rejected in the same manner and reasoning applying. 

Regarding claim 12, the rejection of claim 11 is incorporated, and further, claims 12 is substantially similar to claim 5 respectively, and is rejected in the same manner and reasoning applying. 

Regarding claim 13, the rejection of claim 8 is incorporated, and further, claim 13 is substantially similar to claim 6 respectively, and is rejected in the same manner and reasoning applying. 

Regarding claim 14, the rejection of claim 13 is incorporated, and further, claim 14 is substantially similar to claim 7 respectively, and is rejected in the same manner and reasoning applying. 

Regarding claim 15:
	Step 1 Statutory Category: Claim 15 is directed to a system, which falls within one of the four statutory categories. 
	Step 2A Prong 1 Judicial Exception: Claim 15 recites, in part, “formulate a decision process problem for the RL model”. This limitation, under the broadest reasonable interpretation, covers the recitation of a mathematical concept, see MPEP §2106.04(a)(2)(I). Further, the claim recites: “define at least one of a logarithmic loss function for the RL model and defining an initiation point for the RL model according to an optimized spectral norm of the RL model”. This limitation, under the broadest reasonable interpretation, covers the recitation of a mathematical concept, see MPEP §2106.04(a)(2)(I).
	Step 2A Prong 2 Integration into a Practical Application: This judicial exception is not integrated into a practical application. In particular the claim recites “a computer system”, “one or more computer processors”, “one or more computer readable storage devices”, and “stored program instructions on the one or more computer readable storage devices”. These limitations are additional elements that amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process.  See MPEP §2106.05(f). Further, the claim recites “a trained reinforcement learning system” and “the RL model”. These limitations are additional elements that generally link links the use of the judicial exception to a particular technological environment or field of use.  See MPEP §2106.05(h). Further, the claim recites: “train the system according to the logarithmic loss function or from the initiation point”. This limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP §2106.05(f). Finally, the claim recites: “provide the trained RL model”. This limitation is an additional element that amounts to a post-solution step and as such is considered insignificant extra-solution activity to the judicial exception.  See MPEP §2106.05(g).
Step 2B Significantly More: The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements: that the method is “a computer system”, “one or more computer processors”, “one or more computer readable storage devices”, “stored program instructions on the one or more computer readable storage devices”, and “train the system according to the logarithmic loss function or from the initiation point” amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. Further, the additional elements “a trained reinforcement learning system” and “the RL model” generally link the use of the judicial exception to a particular technological environment or field of use. Elements that merely generally link the use of the judicial exception to a particular technological environment or field of use cannot provide an inventive concept. Finally, the additional element “provide the trained RL model” amounts to insignificant extra-solution activity to the judicial exception and is directed to receiving or transmitting data over a network which courts have recognized as well-understood, routine, and conventional when they are claimed in a generic manner, see MPEP §2106.05(d)(II). The claim is not patent eligible.

Regarding claim 16, the rejection of claim 15 is incorporated, and further, claim 16 is substantially similar to claim 2 and claim 9 respectively, and is rejected in the same manner and reasoning applying. 

Regarding claim 17, the rejection of claim 15 is incorporated, and further, claim 17 is substantially similar to claim 3 and claim 10 respectively, and is rejected in the same manner and reasoning applying. 

Regarding claim 18, the rejection of claim 17 is incorporated, and further, claim 18 is substantially similar to claim 4 and claim 11 respectively, and is rejected in the same manner and reasoning applying. 

Regarding claim 19, the rejection of claim 18 is incorporated, and further, claim 19 is substantially similar to claim 5 and claim 12 respectively, and is rejected in the same manner and reasoning applying. 

Regarding claim 20, the rejection of claim 15 is incorporated, and further, claim 20 is substantially similar to claim 6 and claim 13 respectively, and is rejected in the same manner and reasoning applying. 

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claims 1-2, 8-9, and 15-16 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Kim et al., Goal-Aware Cross-Entropy for Multi-Target Reinforcement Learning,10/26/2021, https://arxiv.org/pdf/2110.12985, hereinafter referred to as “Kim”.

	Regarding claim 1, Kim teaches A … method for training a reinforcement learning (RL) system (Kim, Page 1, Abstract, Lines 4-6, “we propose goal-aware cross-entropy (GACE) loss, that can be utilized in a self-supervised way using auto-labeled goal states alongside reinforcement learning”), the method comprising:
formulating, …, a decision process problem for the RL model (Kim, Page 3, Section 3.1, Lines 1-4, “Reinforcement learning (RL) from Sutton and Barto [41] aims to maximize cumulative rewards by trial-and-error in a Markov Decision Process (MDP). An MDP is defined by a tuple (S,A,R,P,γ), where S is the set of states, A is the set of actions, R : S × A → R is the reward function, P : S ×A×S →R is the transition probability distribution, and γ ∈ (0,1] is the discount factor”);
defining, …, a logarithmic loss function for the RL model (Kim, Page 4, Section 3.3, Lines 5-9, and Equations 3-5, “In our visual navigation experiments, we use asynchronous advantage actor-critic (A3C) [28] as the main algorithm, where the loss                         
                            
                                    L
                                
                                    R
                                    L
                                
                    is defined as the following                         
                            
                                    L
                                
                                    p
                                
                            =
                            ∇
                            l
                            o
                            g
                            π
                            
                                            a
                                        
                                            t
                                        
                                            s
                                        
                                            t
                                        
                                    ,
                                    I
                                
                                            R
                                        
                                            t
                                        
                                    -
                                    V
                                    
                                                    s
                                                
                                                    t
                                                
                                            ,
                                            I
                                        
                            +
                            β
                            ∇
                            H
                            
                                    π
                                    
                                                    a
                                                
                                                    t
                                                
                                                    s
                                                
                                                    t
                                                
                                            ,
                                            I
                                        
                                    3
                                
                                    L
                                
                                    v
                                
                            =
                            
                                                    R
                                                
                                                    t
                                                
                                            -
                                            V
                                            
                                                            s
                                                        
                                                            t
                                                        
                                                    ,
                                                    I
                                                
                                    2
                                
                                    4
                                
                                    L
                                
                                    R
                                    L
                                
                            ≔
                            
                                    L
                                
                                    A
                                    3
                                    C
                                
                            =
                            
                                    L
                                
                                    p
                                
                            +
                            0.5
                            ∙
                            
                                    L
                                
                                    v
                                
                                    5
                                
                    where                         
                            
                                    L
                                
                                    p
                                
                     and                         
                            
                                    L
                                
                                    v
                                
                     respectively denote policy and value loss,                         
                            
                                    R
                                
                                    t
                                
                     denotes the sum of decayed rewards from time steps t to T, and H and β denote the entropy term and its coefficient respectively”; Kim, Page 4, Section 3.3, Lines 15-17, “we propose Goal-Aware Cross-Entropy (GACE) loss as our contribution, which trains the goal-discriminator that facilitates semantic understanding of goals alongside the policy in Figure 1a”; Kim, Page 4, Section 3.3, Equation 8, “                        
                            
                                    L
                                
                                    G
                                    A
                                    C
                                    E
                                
                            =
                            -
                            
                                    ∑
                                    
                                        i
                                        =
                                        0
                                    
                                        M
                                        -
                                        1
                                    
                                    o
                                    n
                                    
                                            e
                                        
                                            h
                                            o
                                            t
                                            
                                                            z
                                                        
                                                            i
                                                        
                                    ∙
                                    
                                            log
                                        
                                        ⁡
                                        
                                                            g
                                                        
                                                            g
                                                            o
                                                            a
                                                            l
                                                            ,
                                                            i
                                                        
                                            8
                                        
                    ”);
training, …, the system according to the logarithmic loss function (Kim, Page 4, Section 3.3, Lines 24-25, “We complete the training procedure by optimizing the overall loss                         
                            
                                    L
                                
                                    t
                                    o
                                    t
                                    a
                                    l
                                
                     as the weighted sum of the two losses in Eq. 9”; Kim, Page 5, Equation 9, “                        
                            
                                    L
                                
                                    t
                                    o
                                    t
                                    a
                                    l
                                
                            =
                            
                                    L
                                
                                    R
                                    L
                                
                            +
                            η
                            
                                    L
                                
                                    G
                                    A
                                    C
                                    E
                                
                    ”); and
providing, …, the trained RL model (Kim, Page 10, Lines 1-6, “To ascertain that an agent trained with GACE and GACE&GDAN indeed becomes goal-aware, we use saliency maps [15] to visualize the operation of three agents within the V2 un seen task, as shown in Figure 5”; Kim, Page 10, Lines 11-14, “The three agents are trained with A3C, GACE, and GACE&GDAN, respectively, for 4M updates”; see also Kim, Page 10, Figure 5; In order for the model to be used in operation and develop saliency maps, it must have been provided).
	Kim also teaches that the method is computer-implemented and that each of the steps are performed by [the] one or more computer processors (Kim, Page 5, Section 4, Paragraph 1, Lines 3-4, “We develop and conduct experiments on (1) visual navigation tasks based on ViZDoom [23, 18], and (2) robot arm manipulation tasks based on MuJoCo [42]”; Kim, Page 9, Table 3 and Figure 4; A person of ordinary skill in the art would recognize that “ViZDoom” and “MuJoCo” as well as the results displayed in Table 3 and Figure 4 would require the use of a computer, which also provides evidence for a computer processor). 
	It is noted that applicant uses alternative language and Kim teaches at least one of the alternatives. 

	Regarding claim 2, the rejection of claim 1 is incorporated, and further, Kim teaches defining, by the one or more computer processors, a logarithmic loss function for the RL model (Kim, Page 4, Section 3.3, Lines 5-9, and Equations 3-5, “In our visual navigation experiments, we use asynchronous advantage actor-critic (A3C) [28] as the main algorithm, where the loss                         
                            
                                    L
                                
                                    R
                                    L
                                
                    is defined as the following                         
                            
                                    L
                                
                                    p
                                
                            =
                            ∇
                            l
                            o
                            g
                            π
                            
                                            a
                                        
                                            t
                                        
                                            s
                                        
                                            t
                                        
                                    ,
                                    I
                                
                                            R
                                        
                                            t
                                        
                                    -
                                    V
                                    
                                                    s
                                                
                                                    t
                                                
                                            ,
                                            I
                                        
                            +
                            β
                            ∇
                            H
                            
                                    π
                                    
                                                    a
                                                
                                                    t
                                                
                                                    s
                                                
                                                    t
                                                
                                            ,
                                            I
                                        
                                    3
                                
                                    L
                                
                                    v
                                
                            =
                            
                                                    R
                                                
                                                    t
                                                
                                            -
                                            V
                                            
                                                            s
                                                        
                                                            t
                                                        
                                                    ,
                                                    I
                                                
                                    2
                                
                                    4
                                
                                    L
                                
                                    R
                                    L
                                
                            ≔
                            
                                    L
                                
                                    A
                                    3
                                    C
                                
                            =
                            
                                    L
                                
                                    p
                                
                            +
                            0.5
                            ∙
                            
                                    L
                                
                                    v
                                
                                    5
                                
                    where                         
                            
                                    L
                                
                                    p
                                
                     and                         
                            
                                    L
                                
                                    v
                                
                     respectively denote policy and value loss,                         
                            
                                    R
                                
                                    t
                                
                     denotes the sum of decayed rewards from time steps t to T, and H and β denote the entropy term and its coefficient respectively”; Kim, Page 4, Section 3.3, Lines 15-17, “we propose Goal-Aware Cross-Entropy (GACE) loss as our contribution, which trains the goal-discriminator that facilitates semantic understanding of goals alongside the policy in Figure 1a”; Kim, Page 4, Section 3.3, Equation 8, “                        
                            
                                    L
                                
                                    G
                                    A
                                    C
                                    E
                                
                            =
                            -
                            
                                    ∑
                                    
                                        i
                                        =
                                        0
                                    
                                        M
                                        -
                                        1
                                    
                                    o
                                    n
                                    
                                            e
                                        
                                            h
                                            o
                                            t
                                            
                                                            z
                                                        
                                                            i
                                                        
                                    ∙
                                    
                                            log
                                        
                                        ⁡
                                        
                                                            g
                                                        
                                                            g
                                                            o
                                                            a
                                                            l
                                                            ,
                                                            i
                                                        
                                            8
                                        
                    ”);
training, by the one or more computer processors, the system according to the logarithmic loss function (Kim, Page 4, Section 3.3, Lines 24-25, “We complete the training procedure by optimizing the overall loss                         
                            
                                    L
                                
                                    t
                                    o
                                    t
                                    a
                                    l
                                
                     as the weighted sum of the two losses in Eq. 9”; Kim, Page 5, Equation 9, “                        
                            
                                    L
                                
                                    t
                                    o
                                    t
                                    a
                                    l
                                
                            =
                            
                                    L
                                
                                    R
                                    L
                                
                            +
                            η
                            
                                    L
                                
                                    G
                                    A
                                    C
                                    E
                                
                    ”); and
providing, by the one or more computer processors, the trained RL model (Kim, Page 10, Lines 1-6, “To ascertain that an agent trained with GACE and GACE&GDAN indeed becomes goal-aware, we use saliency maps [15] to visualize the operation of three agents within the V2 un seen task, as shown in Figure 5”; Kim, Page 10, Lines 11-14, “The three agents are trained with A3C, GACE, and GACE&GDAN, respectively, for 4M updates”; see also Kim, Page 10, Figure 5; In order for the model to be used in operation and develop saliency maps, it must have been provided).

Regarding claim 8, Kim teaches A computer program product for providing trained reinforcement learning systems, the computer program product comprising one or more computer readable storage media and collectively stored program instructions on the one or more computer readable storage media, the stored program instructions which, when executed (Kim, Page 1, Abstract, Lines 4-6, “we propose goal-aware cross-entropy (GACE) loss, that can be utilized in a self-supervised way using auto-labeled goal states alongside reinforcement learning”; Kim, Page 5, Section 4, Paragraph 1, Lines 3-4, “We develop and conduct experiments on (1) visual navigation tasks based on ViZDoom [23, 18], and (2) robot arm manipulation tasks based on MuJoCo [42]”; Kim, Page 9, Table 3 and Figure 4; A person of ordinary skill in the art would recognize that “ViZDoom” and “MuJoCo” as well as the results displayed in Table 3 and Figure 4 would require the use of a computer, which also provides evidence for a computer program product, computer readable storage media, and instructions), cause one or more computer systems to:
formulate a decision process problem for the RL model (Kim, Page 3, Section 3.1, Lines 1-4, “Reinforcement learning (RL) from Sutton and Barto [41] aims to maximize cumulative rewards by trial-and-error in a Markov Decision Process (MDP). An MDP is defined by a tuple (S,A,R,P,γ), where S is the set of states, A is the set of actions, R : S × A → R is the reward function, P : S ×A×S →R is the transition probability distribution, and γ ∈ (0,1] is the discount factor”);
define a logarithmic loss function for the RL model (Kim, Page 4, Section 3.3, Lines 5-9, and Equations 3-5, “In our visual navigation experiments, we use asynchronous advantage actor-critic (A3C) [28] as the main algorithm, where the loss                         
                            
                                    L
                                
                                    R
                                    L
                                
                    is defined as the following                         
                            
                                    L
                                
                                    p
                                
                            =
                            ∇
                            l
                            o
                            g
                            π
                            
                                            a
                                        
                                            t
                                        
                                            s
                                        
                                            t
                                        
                                    ,
                                    I
                                
                                            R
                                        
                                            t
                                        
                                    -
                                    V
                                    
                                                    s
                                                
                                                    t
                                                
                                            ,
                                            I
                                        
                            +
                            β
                            ∇
                            H
                            
                                    π
                                    
                                                    a
                                                
                                                    t
                                                
                                                    s
                                                
                                                    t
                                                
                                            ,
                                            I
                                        
                                    3
                                
                                    L
                                
                                    v
                                
                            =
                            
                                                    R
                                                
                                                    t
                                                
                                            -
                                            V
                                            
                                                            s
                                                        
                                                            t
                                                        
                                                    ,
                                                    I
                                                
                                    2
                                
                                    4
                                
                                    L
                                
                                    R
                                    L
                                
                            ≔
                            
                                    L
                                
                                    A
                                    3
                                    C
                                
                            =
                            
                                    L
                                
                                    p
                                
                            +
                            0.5
                            ∙
                            
                                    L
                                
                                    v
                                
                                    5
                                
                    where                         
                            
                                    L
                                
                                    p
                                
                     and                         
                            
                                    L
                                
                                    v
                                
                     respectively denote policy and value loss,                         
                            
                                    R
                                
                                    t
                                
                     denotes the sum of decayed rewards from time steps t to T, and H and β denote the entropy term and its coefficient respectively”; Kim, Page 4, Section 3.3, Lines 15-17, “we propose Goal-Aware Cross-Entropy (GACE) loss as our contribution, which trains the goal-discriminator that facilitates semantic understanding of goals alongside the policy in Figure 1a”; Kim, Page 4, Section 3.3, Equation 8, “                        
                            
                                    L
                                
                                    G
                                    A
                                    C
                                    E
                                
                            =
                            -
                            
                                    ∑
                                    
                                        i
                                        =
                                        0
                                    
                                        M
                                        -
                                        1
                                    
                                    o
                                    n
                                    
                                            e
                                        
                                            h
                                            o
                                            t
                                            
                                                            z
                                                        
                                                            i
                                                        
                                    ∙
                                    
                                            log
                                        
                                        ⁡
                                        
                                                            g
                                                        
                                                            g
                                                            o
                                                            a
                                                            l
                                                            ,
                                                            i
                                                        
                                            8
                                        
                    ”);
train the system according to the logarithmic loss function or from the initiation point (Kim, Page 4, Section 3.3, Lines 24-25, “We complete the training procedure by optimizing the overall loss                         
                            
                                    L
                                
                                    t
                                    o
                                    t
                                    a
                                    l
                                
                     as the weighted sum of the two losses in Eq. 9”; Kim, Page 5, Equation 9, “                        
                            
                                    L
                                
                                    t
                                    o
                                    t
                                    a
                                    l
                                
                            =
                            
                                    L
                                
                                    R
                                    L
                                
                            +
                            η
                            
                                    L
                                
                                    G
                                    A
                                    C
                                    E
                                
                    ”); and
provide the trained RL model (Kim, Page 10, Lines 1-6, “To ascertain that an agent trained with GACE and GACE&GDAN indeed becomes goal-aware, we use saliency maps [15] to visualize the operation of three agents within the V2 un seen task, as shown in Figure 5”; Kim, Page 10, Lines 11-14, “The three agents are trained with A3C, GACE, and GACE&GDAN, respectively, for 4M updates”; see also Kim, Page 10, Figure 5; In order for the model to be used in operation and develop saliency maps, it must have been provided).
It is noted that applicant uses alternative language and Kim teaches at least one of the alternatives.

Regarding claim 9, the rejection of claim 8 is incorporated, and further, Kim teaches define a logarithmic loss function for the RL model (Kim, Page 4, Section 3.3, Lines 5-9, and Equations 3-5, “In our visual navigation experiments, we use asynchronous advantage actor-critic (A3C) [28] as the main algorithm, where the loss                         
                            
                                    L
                                
                                    R
                                    L
                                
                    is defined as the following                         
                            
                                    L
                                
                                    p
                                
                            =
                            ∇
                            l
                            o
                            g
                            π
                            
                                            a
                                        
                                            t
                                        
                                            s
                                        
                                            t
                                        
                                    ,
                                    I
                                
                                            R
                                        
                                            t
                                        
                                    -
                                    V
                                    
                                                    s
                                                
                                                    t
                                                
                                            ,
                                            I
                                        
                            +
                            β
                            ∇
                            H
                            
                                    π
                                    
                                                    a
                                                
                                                    t
                                                
                                                    s
                                                
                                                    t
                                                
                                            ,
                                            I
                                        
                                    3
                                
                                    L
                                
                                    v
                                
                            =
                            
                                                    R
                                                
                                                    t
                                                
                                            -
                                            V
                                            
                                                            s
                                                        
                                                            t
                                                        
                                                    ,
                                                    I
                                                
                                    2
                                
                                    4
                                
                                    L
                                
                                    R
                                    L
                                
                            ≔
                            
                                    L
                                
                                    A
                                    3
                                    C
                                
                            =
                            
                                    L
                                
                                    p
                                
                            +
                            0.5
                            ∙
                            
                                    L
                                
                                    v
                                
                                    5
                                
                    where                         
                            
                                    L
                                
                                    p
                                
                     and                         
                            
                                    L
                                
                                    v
                                
                     respectively denote policy and value loss,                         
                            
                                    R
                                
                                    t
                                
                     denotes the sum of decayed rewards from time steps t to T, and H and β denote the entropy term and its coefficient respectively”; Kim, Page 4, Section 3.3, Lines 15-17, “we propose Goal-Aware Cross-Entropy (GACE) loss as our contribution, which trains the goal-discriminator that facilitates semantic understanding of goals alongside the policy in Figure 1a”; Kim, Page 4, Section 3.3, Equation 8, “                        
                            
                                    L
                                
                                    G
                                    A
                                    C
                                    E
                                
                            =
                            -
                            
                                    ∑
                                    
                                        i
                                        =
                                        0
                                    
                                        M
                                        -
                                        1
                                    
                                    o
                                    n
                                    
                                            e
                                        
                                            h
                                            o
                                            t
                                            
                                                            z
                                                        
                                                            i
                                                        
                                    ∙
                                    
                                            log
                                        
                                        ⁡
                                        
                                                            g
                                                        
                                                            g
                                                            o
                                                            a
                                                            l
                                                            ,
                                                            i
                                                        
                                            8
                                        
                    ”);
train the system according to the logarithmic loss function (Kim, Page 4, Section 3.3, Lines 24-25, “We complete the training procedure by optimizing the overall loss                         
                            
                                    L
                                
                                    t
                                    o
                                    t
                                    a
                                    l
                                
                     as the weighted sum of the two losses in Eq. 9”; Kim, Page 5, Equation 9, “                        
                            
                                    L
                                
                                    t
                                    o
                                    t
                                    a
                                    l
                                
                            =
                            
                                    L
                                
                                    R
                                    L
                                
                            +
                            η
                            
                                    L
                                
                                    G
                                    A
                                    C
                                    E
                                
                    ”); and
provide the trained RL model (Kim, Page 10, Lines 1-6, “To ascertain that an agent trained with GACE and GACE&GDAN indeed becomes goal-aware, we use saliency maps [15] to visualize the operation of three agents within the V2 un seen task, as shown in Figure 5”; Kim, Page 10, Lines 11-14, “The three agents are trained with A3C, GACE, and GACE&GDAN, respectively, for 4M updates”; see also Kim, Page 10, Figure 5; In order for the model to be used in operation and develop saliency maps, it must have been provided).

Regarding claim 15, Kim teaches A computer system for providing a trained reinforcement learning system, the computer system comprising: one or more computer processors; one or more computer readable storage devices; and stored program instructions on the one or more computer readable storage devices for execution by the one or more computer processors, the stored program instructions which, when executed (Kim, Page 1, Abstract, Lines 4-6, “we propose goal-aware cross-entropy (GACE) loss, that can be utilized in a self-supervised way using auto-labeled goal states alongside reinforcement learning”; Kim, Page 5, Section 4, Paragraph 1, Lines 3-4, “We develop and conduct experiments on (1) visual navigation tasks based on ViZDoom [23, 18], and (2) robot arm manipulation tasks based on MuJoCo [42]”; Kim, Page 9, Table 3 and Figure 4; A person of ordinary skill in the art would recognize that “ViZDoom” and “MuJoCo” as well as the results displayed in Table 3 and Figure 4 would require the use of a computer, which also provides evidence for a computer processor, computer readable storage devices, and instructions), cause the one or more computer processors to:
formulate a decision process problem for the RL model (Kim, Page 3, Section 3.1, Lines 1-4, “Reinforcement learning (RL) from Sutton and Barto [41] aims to maximize cumulative rewards by trial-and-error in a Markov Decision Process (MDP). An MDP is defined by a tuple (S,A,R,P,γ), where S is the set of states, A is the set of actions, R : S × A → R is the reward function, P : S ×A×S →R is the transition probability distribution, and γ ∈ (0,1] is the discount factor”);
define a logarithmic loss function for the RL model (Kim, Page 4, Section 3.3, Lines 5-9, and Equations 3-5, “In our visual navigation experiments, we use asynchronous advantage actor-critic (A3C) [28] as the main algorithm, where the loss                         
                            
                                    L
                                
                                    R
                                    L
                                
                    is defined as the following                         
                            
                                    L
                                
                                    p
                                
                            =
                            ∇
                            l
                            o
                            g
                            π
                            
                                            a
                                        
                                            t
                                        
                                            s
                                        
                                            t
                                        
                                    ,
                                    I
                                
                                            R
                                        
                                            t
                                        
                                    -
                                    V
                                    
                                                    s
                                                
                                                    t
                                                
                                            ,
                                            I
                                        
                            +
                            β
                            ∇
                            H
                            
                                    π
                                    
                                                    a
                                                
                                                    t
                                                
                                                    s
                                                
                                                    t
                                                
                                            ,
                                            I
                                        
                                    3
                                
                                    L
                                
                                    v
                                
                            =
                            
                                                    R
                                                
                                                    t
                                                
                                            -
                                            V
                                            
                                                            s
                                                        
                                                            t
                                                        
                                                    ,
                                                    I
                                                
                                    2
                                
                                    4
                                
                                    L
                                
                                    R
                                    L
                                
                            ≔
                            
                                    L
                                
                                    A
                                    3
                                    C
                                
                            =
                            
                                    L
                                
                                    p
                                
                            +
                            0.5
                            ∙
                            
                                    L
                                
                                    v
                                
                                    5
                                
                    where                         
                            
                                    L
                                
                                    p
                                
                     and                         
                            
                                    L
                                
                                    v
                                
                     respectively denote policy and value loss,                         
                            
                                    R
                                
                                    t
                                
                     denotes the sum of decayed rewards from time steps t to T, and H and β denote the entropy term and its coefficient respectively”; Kim, Page 4, Section 3.3, Lines 15-17, “we propose Goal-Aware Cross-Entropy (GACE) loss as our contribution, which trains the goal-discriminator that facilitates semantic understanding of goals alongside the policy in Figure 1a”; Kim, Page 4, Section 3.3, Equation 8, “                        
                            
                                    L
                                
                                    G
                                    A
                                    C
                                    E
                                
                            =
                            -
                            
                                    ∑
                                    
                                        i
                                        =
                                        0
                                    
                                        M
                                        -
                                        1
                                    
                                    o
                                    n
                                    
                                            e
                                        
                                            h
                                            o
                                            t
                                            
                                                            z
                                                        
                                                            i
                                                        
                                    ∙
                                    
                                            log
                                        
                                        ⁡
                                        
                                                            g
                                                        
                                                            g
                                                            o
                                                            a
                                                            l
                                                            ,
                                                            i
                                                        
                                            8
                                        
                    ”);
train the system according to the logarithmic loss function or from the initiation point (Kim, Page 4, Section 3.3, Lines 24-25, “We complete the training procedure by optimizing the overall loss                         
                            
                                    L
                                
                                    t
                                    o
                                    t
                                    a
                                    l
                                
                     as the weighted sum of the two losses in Eq. 9”; Kim, Page 5, Equation 9, “                        
                            
                                    L
                                
                                    t
                                    o
                                    t
                                    a
                                    l
                                
                            =
                            
                                    L
                                
                                    R
                                    L
                                
                            +
                            η
                            
                                    L
                                
                                    G
                                    A
                                    C
                                    E
                                
                    ”); and
provide the trained RL model (Kim, Page 10, Lines 1-6, “To ascertain that an agent trained with GACE and GACE&GDAN indeed becomes goal-aware, we use saliency maps [15] to visualize the operation of three agents within the V2 un seen task, as shown in Figure 5”; Kim, Page 10, Lines 11-14, “The three agents are trained with A3C, GACE, and GACE&GDAN, respectively, for 4M updates”; see also Kim, Page 10, Figure 5; In order for the model to be used in operation and develop saliency maps, it must have been provided).

Regarding claim 16, the rejection of claim 15 is incorporated, and further, Kim teaches define a logarithmic loss function for the RL model (Kim, Page 4, Section 3.3, Lines 5-9, and Equations 3-5, “In our visual navigation experiments, we use asynchronous advantage actor-critic (A3C) [28] as the main algorithm, where the loss                         
                            
                                    L
                                
                                    R
                                    L
                                
                    is defined as the following                         
                            
                                    L
                                
                                    p
                                
                            =
                            ∇
                            l
                            o
                            g
                            π
                            
                                            a
                                        
                                            t
                                        
                                            s
                                        
                                            t
                                        
                                    ,
                                    I
                                
                                            R
                                        
                                            t
                                        
                                    -
                                    V
                                    
                                                    s
                                                
                                                    t
                                                
                                            ,
                                            I
                                        
                            +
                            β
                            ∇
                            H
                            
                                    π
                                    
                                                    a
                                                
                                                    t
                                                
                                                    s
                                                
                                                    t
                                                
                                            ,
                                            I
                                        
                                    3
                                
                                    L
                                
                                    v
                                
                            =
                            
                                                    R
                                                
                                                    t
                                                
                                            -
                                            V
                                            
                                                            s
                                                        
                                                            t
                                                        
                                                    ,
                                                    I
                                                
                                    2
                                
                                    4
                                
                                    L
                                
                                    R
                                    L
                                
                            ≔
                            
                                    L
                                
                                    A
                                    3
                                    C
                                
                            =
                            
                                    L
                                
                                    p
                                
                            +
                            0.5
                            ∙
                            
                                    L
                                
                                    v
                                
                                    5
                                
                    where                         
                            
                                    L
                                
                                    p
                                
                     and                         
                            
                                    L
                                
                                    v
                                
                     respectively denote policy and value loss,                         
                            
                                    R
                                
                                    t
                                
                     denotes the sum of decayed rewards from time steps t to T, and H and β denote the entropy term and its coefficient respectively”; Kim, Page 4, Section 3.3, Lines 15-17, “we propose Goal-Aware Cross-Entropy (GACE) loss as our contribution, which trains the goal-discriminator that facilitates semantic understanding of goals alongside the policy in Figure 1a”; Kim, Page 4, Section 3.3, Equation 8, “                        
                            
                                    L
                                
                                    G
                                    A
                                    C
                                    E
                                
                            =
                            -
                            
                                    ∑
                                    
                                        i
                                        =
                                        0
                                    
                                        M
                                        -
                                        1
                                    
                                    o
                                    n
                                    
                                            e
                                        
                                            h
                                            o
                                            t
                                            
                                                            z
                                                        
                                                            i
                                                        
                                    ∙
                                    
                                            log
                                        
                                        ⁡
                                        
                                                            g
                                                        
                                                            g
                                                            o
                                                            a
                                                            l
                                                            ,
                                                            i
                                                        
                                            8
                                        
                    ”);
train the system according to the logarithmic loss function (Kim, Page 4, Section 3.3, Lines 24-25, “We complete the training procedure by optimizing the overall loss                         
                            
                                    L
                                
                                    t
                                    o
                                    t
                                    a
                                    l
                                
                     as the weighted sum of the two losses in Eq. 9”; Kim, Page 5, Equation 9, “                        
                            
                                    L
                                
                                    t
                                    o
                                    t
                                    a
                                    l
                                
                            =
                            
                                    L
                                
                                    R
                                    L
                                
                            +
                            η
                            
                                    L
                                
                                    G
                                    A
                                    C
                                    E
                                
                    ”); and
provide the trained RL model (Kim, Page 10, Lines 1-6, “To ascertain that an agent trained with GACE and GACE&GDAN indeed becomes goal-aware, we use saliency maps [15] to visualize the operation of three agents within the V2 un seen task, as shown in Figure 5”; Kim, Page 10, Lines 11-14, “The three agents are trained with A3C, GACE, and GACE&GDAN, respectively, for 4M updates”; see also Kim, Page 10, Figure 5; In order for the model to be used in operation and develop saliency maps, it must have been provided).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 3, 6, 10, 13, 17, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Kim in view of Bjorck et al., Towards Deeper Deep Reinforcement Learning with Spectral Normalization, 01/03/2022, https://arxiv.org/pdf/2106.01151v4, hereinafter referred to as "Bjorck".

Regarding claim 3, the rejection of claim 1 is incorporated, and further, Kim teaches performing the steps of the method by the one or more computer processors (Kim, Page 5, Section 4, Paragraph 1, Lines 3-4, “We develop and conduct experiments on (1) visual navigation tasks based on ViZDoom [23, 18], and (2) robot arm manipulation tasks based on MuJoCo [42]”; Kim, Page 9, Table 3 and Figure 4; A person of ordinary skill in the art would recognize that “ViZDoom” and “MuJoCo” as well as the results displayed in Table 3 and Figure 4 would require the use of a computer, which also provides evidence for a computer processor). 
Kim does not explicitly teach defining, …, an initiation point for the RL model according to the optimized spectral norm; and training, …, the RL model from the initiation point. 
Bjorck teaches defining, …, an initiation point for the RL model according to the optimized spectral norm (Bjorck, Page 6, Lines 1-5, “Equation (5) suggest that the critic could be made smooth if the spectral norms of all layers are bounded. Fortunately, there is a method from the GAN literature which achieves this: spectral normalization [47]. Spectral normalization divides the weight W for each layer by its largest singular value                         
                            
                                    σ
                                
                                    m
                                    a
                                    x
                                
                     which ensures that all layers have operator norm 1”; Bjorck, Page 6, Lines 10-11=2, “By repeating this procedure for all layers, we ensure that the spectral norms of all layers are no larger than one. If that is the case, eq. (5) suggests that the critic should be stable in the forward pass. This would then bound the gradients being propagated into the actor as per eq. (3)”; The state of the model after spectral normalization, with bounded gradients, is considered to be the “initial point for the RL model”); and
training, …, the RL model from the initiation point (Bjorck, Page 6, Section 5.1, Lines 3-9, “Specifically, for both the actor and the critic, we apply spectral normalization to each linear layer except the first and last. Otherwise, the setup follows Section 3.1. As before, when learning crashes, we simply use the performance recorded before crashes for future time steps. Learning curves for individual environments are given in Figure 4, again over 10 seeds. We see that after smoothing with spectral normalization, performance is relatively stable across tasks, even when using a deep network with normalization and skip connections. On the other hand, without smoothing, learning is slow and sometimes fails”).
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to have modified the reinforcement learning method of Kim to include defining an initiation point according to an optimized spectral norm as taught by Bjorck. The motivation to do so would have been that spectral normalization enables stable training with large modern architectures, which results in performance improvements of the model (Bjorck, Page 1, Abstract, Lines 13-17, “We demonstrate that spectral normalization (SN) can mitigate this issue and enable stable training with large modern architectures. After smoothing with SN, larger models yield significant performance improvements—suggesting that more “easy” gains may be had by focusing on model architectures in addition to algorithmic innovations”). 

Regarding claim 6, the rejection of claim 1 is incorporated, and further, Kim teaches formulating, …, a decision process problem for the RL model (Kim, Page 3, Section 3.1, Lines 1-4, “Reinforcement learning (RL) from Sutton and Barto [41] aims to maximize cumulative rewards by trial-and-error in a Markov Decision Process (MDP). An MDP is defined by a tuple (S,A,R,P,γ), where S is the set of states, A is the set of actions, R : S × A → R is the reward function, P : S ×A×S →R is the transition probability distribution, and γ ∈ (0,1] is the discount factor”); 
defining, …, a logarithmic loss function for the RL model … (Kim, Page 4, Section 3.3, Lines 5-9, and Equations 3-5, “In our visual navigation experiments, we use asynchronous advantage actor-critic (A3C) [28] as the main algorithm, where the loss                         
                            
                                    L
                                
                                    R
                                    L
                                
                    is defined as the following                         
                            
                                    L
                                
                                    p
                                
                            =
                            ∇
                            l
                            o
                            g
                            π
                            
                                            a
                                        
                                            t
                                        
                                            s
                                        
                                            t
                                        
                                    ,
                                    I
                                
                                            R
                                        
                                            t
                                        
                                    -
                                    V
                                    
                                                    s
                                                
                                                    t
                                                
                                            ,
                                            I
                                        
                            +
                            β
                            ∇
                            H
                            
                                    π
                                    
                                                    a
                                                
                                                    t
                                                
                                                    s
                                                
                                                    t
                                                
                                            ,
                                            I
                                        
                                    3
                                
                                    L
                                
                                    v
                                
                            =
                            
                                                    R
                                                
                                                    t
                                                
                                            -
                                            V
                                            
                                                            s
                                                        
                                                            t
                                                        
                                                    ,
                                                    I
                                                
                                    2
                                
                                    4
                                
                                    L
                                
                                    R
                                    L
                                
                            ≔
                            
                                    L
                                
                                    A
                                    3
                                    C
                                
                            =
                            
                                    L
                                
                                    p
                                
                            +
                            0.5
                            ∙
                            
                                    L
                                
                                    v
                                
                                    5
                                
                    where                         
                            
                                    L
                                
                                    p
                                
                     and                         
                            
                                    L
                                
                                    v
                                
                     respectively denote policy and value loss,                         
                            
                                    R
                                
                                    t
                                
                     denotes the sum of decayed rewards from time steps t to T, and H and β denote the entropy term and its coefficient respectively”; Kim, Page 4, Section 3.3, Lines 15-17, “we propose Goal-Aware Cross-Entropy (GACE) loss as our contribution, which trains the goal-discriminator that facilitates semantic understanding of goals alongside the policy in Figure 1a”; Kim, Page 4, Section 3.3, Equation 8, “                        
                            
                                    L
                                
                                    G
                                    A
                                    C
                                    E
                                
                            =
                            -
                            
                                    ∑
                                    
                                        i
                                        =
                                        0
                                    
                                        M
                                        -
                                        1
                                    
                                    o
                                    n
                                    
                                            e
                                        
                                            h
                                            o
                                            t
                                            
                                                            z
                                                        
                                                            i
                                                        
                                    ∙
                                    
                                            log
                                        
                                        ⁡
                                        
                                                            g
                                                        
                                                            g
                                                            o
                                                            a
                                                            l
                                                            ,
                                                            i
                                                        
                                            8
                                        
                    ”);
training, …, the system … according to the logarithmic loss function (Kim, Page 4, Section 3.3, Lines 24-25, “We complete the training procedure by optimizing the overall loss                         
                            
                                    L
                                
                                    t
                                    o
                                    t
                                    a
                                    l
                                
                     as the weighted sum of the two losses in Eq. 9”; Kim, Page 5, Equation 9, “                        
                            
                                    L
                                
                                    t
                                    o
                                    t
                                    a
                                    l
                                
                            =
                            
                                    L
                                
                                    R
                                    L
                                
                            +
                            η
                            
                                    L
                                
                                    G
                                    A
                                    C
                                    E
                                
                    ”); and 
providing, …, the trained RL model (Kim, Page 10, Lines 1-6, “To ascertain that an agent trained with GACE and GACE&GDAN indeed becomes goal-aware, we use saliency maps [15] to visualize the operation of three agents within the V2 un seen task, as shown in Figure 5”; Kim, Page 10, Lines 11-14, “The three agents are trained with A3C, GACE, and GACE&GDAN, respectively, for 4M updates”; see also Kim, Page 10, Figure 5; In order for the model to be used in operation and develop saliency maps, it must have been provided). 
Kim also teaches performing the steps of the method by the one or more computer processors (Kim, Page 5, Section 4, Paragraph 1, Lines 3-4, “We develop and conduct experiments on (1) visual navigation tasks based on ViZDoom [23, 18], and (2) robot arm manipulation tasks based on MuJoCo [42]”; Kim, Page 9, Table 3 and Figure 4; A person of ordinary skill in the art would recognize that “ViZDoom” and “MuJoCo” as well as the results displayed in Table 3 and Figure 4 would require the use of a computer, which also provides evidence for a computer processor).
Kim does not explicitly teach defining… an initiation point for the RL model according to an optimized spectral norm of the RL model nor training the system from the initiation point…. 
Bjorck teaches defining… an initiation point for the RL model according to an optimized spectral norm of the RL model (Bjorck, Page 6, Lines 1-5, “Equation (5) suggest that the critic could be made smooth if the spectral norms of all layers are bounded. Fortunately, there is a method from the GAN literature which achieves this: spectral normalization [47]. Spectral normalization divides the weight W for each layer by its largest singular value                         
                            
                                    σ
                                
                                    m
                                    a
                                    x
                                
                     which ensures that all layers have operator norm 1”; Bjorck, Page 6, Lines 10-11=2, “By repeating this procedure for all layers, we ensure that the spectral norms of all layers are no larger than one. If that is the case, eq. (5) suggests that the critic should be stable in the forward pass. This would then bound the gradients being propagated into the actor as per eq. (3)”; The state of the model after spectral normalization, with bounded gradients, is considered to be the “initial point for the RL model”) and 
training the system from the initiation point… (Bjorck, Page 6, Section 5.1, Lines 3-9, “Specifically, for both the actor and the critic, we apply spectral normalization to each linear layer except the first and last. Otherwise, the setup follows Section 3.1. As before, when learning crashes, we simply use the performance recorded before crashes for future time steps. Learning curves for individual environments are given in Figure 4, again over 10 seeds. We see that after smoothing with spectral normalization, performance is relatively stable across tasks, even when using a deep network with normalization and skip connections. On the other hand, without smoothing, learning is slow and sometimes fails”).
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to have modified the reinforcement learning method of Kim to include defining an initiation point according to an optimized spectral norm as taught by Bjorck. The motivation to do so would have been that spectral normalization enables stable training with large modern architectures, which results in performance improvements of the model (Bjorck, Page 1, Abstract, Lines 13-17, “We demonstrate that spectral normalization (SN) can mitigate this issue and enable stable training with large modern architectures. After smoothing with SN, larger models yield significant performance improvements—suggesting that more “easy” gains may be had by focusing on model architectures in addition to algorithmic innovations”). 

Regarding claim 10, the rejection of claim 8 is incorporated. 
Kim does not explicitly teach define an initiation point for the RL model according to the optimized spectral norm; and train the RL model from the initiation point. 
Bjorck teaches define an initiation point for the RL model according to the optimized spectral norm (Bjorck, Page 6, Lines 1-5, “Equation (5) suggest that the critic could be made smooth if the spectral norms of all layers are bounded. Fortunately, there is a method from the GAN literature which achieves this: spectral normalization [47]. Spectral normalization divides the weight W for each layer by its largest singular value                         
                            
                                    σ
                                
                                    m
                                    a
                                    x
                                
                     which ensures that all layers have operator norm 1”; Bjorck, Page 6, Lines 10-11=2, “By repeating this procedure for all layers, we ensure that the spectral norms of all layers are no larger than one. If that is the case, eq. (5) suggests that the critic should be stable in the forward pass. This would then bound the gradients being propagated into the actor as per eq. (3)”; The state of the model after spectral normalization, with bounded gradients, is considered to be the “initial point for the RL model”); and
train the RL model from the initiation point (Bjorck, Page 6, Section 5.1, Lines 3-9, “Specifically, for both the actor and the critic, we apply spectral normalization to each linear layer except the first and last. Otherwise, the setup follows Section 3.1. As before, when learning crashes, we simply use the performance recorded before crashes for future time steps. Learning curves for individual environments are given in Figure 4, again over 10 seeds. We see that after smoothing with spectral normalization, performance is relatively stable across tasks, even when using a deep network with normalization and skip connections. On the other hand, without smoothing, learning is slow and sometimes fails”).
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to have modified the reinforcement learning method of Kim to include defining an initiation point according to an optimized spectral norm as taught by Bjorck. The motivation to do so would have been that spectral normalization enables stable training with large modern architectures, which results in performance improvements of the model (Bjorck, Page 1, Abstract, Lines 13-17, “We demonstrate that spectral normalization (SN) can mitigate this issue and enable stable training with large modern architectures. After smoothing with SN, larger models yield significant performance improvements—suggesting that more “easy” gains may be had by focusing on model architectures in addition to algorithmic innovations”). 

Regarding claim 13, the rejection of claim 8 is incorporated, and further, Kim teaches formulate a decision process problem for the RL model (Kim, Page 3, Section 3.1, Lines 1-4, “Reinforcement learning (RL) from Sutton and Barto [41] aims to maximize cumulative rewards by trial-and-error in a Markov Decision Process (MDP). An MDP is defined by a tuple (S,A,R,P,γ), where S is the set of states, A is the set of actions, R : S × A → R is the reward function, P : S ×A×S →R is the transition probability distribution, and γ ∈ (0,1] is the discount factor”); 
define a logarithmic loss function for the RL model … (Kim, Page 4, Section 3.3, Lines 5-9, and Equations 3-5, “In our visual navigation experiments, we use asynchronous advantage actor-critic (A3C) [28] as the main algorithm, where the loss                         
                            
                                    L
                                
                                    R
                                    L
                                
                    is defined as the following                         
                            
                                    L
                                
                                    p
                                
                            =
                            ∇
                            l
                            o
                            g
                            π
                            
                                            a
                                        
                                            t
                                        
                                            s
                                        
                                            t
                                        
                                    ,
                                    I
                                
                                            R
                                        
                                            t
                                        
                                    -
                                    V
                                    
                                                    s
                                                
                                                    t
                                                
                                            ,
                                            I
                                        
                            +
                            β
                            ∇
                            H
                            
                                    π
                                    
                                                    a
                                                
                                                    t
                                                
                                                    s
                                                
                                                    t
                                                
                                            ,
                                            I
                                        
                                    3
                                
                                    L
                                
                                    v
                                
                            =
                            
                                                    R
                                                
                                                    t
                                                
                                            -
                                            V
                                            
                                                            s
                                                        
                                                            t
                                                        
                                                    ,
                                                    I
                                                
                                    2
                                
                                    4
                                
                                    L
                                
                                    R
                                    L
                                
                            ≔
                            
                                    L
                                
                                    A
                                    3
                                    C
                                
                            =
                            
                                    L
                                
                                    p
                                
                            +
                            0.5
                            ∙
                            
                                    L
                                
                                    v
                                
                                    5
                                
                    where                         
                            
                                    L
                                
                                    p
                                
                     and                         
                            
                                    L
                                
                                    v
                                
                     respectively denote policy and value loss,                         
                            
                                    R
                                
                                    t
                                
                     denotes the sum of decayed rewards from time steps t to T, and H and β denote the entropy term and its coefficient respectively”; Kim, Page 4, Section 3.3, Lines 15-17, “we propose Goal-Aware Cross-Entropy (GACE) loss as our contribution, which trains the goal-discriminator that facilitates semantic understanding of goals alongside the policy in Figure 1a”; Kim, Page 4, Section 3.3, Equation 8, “                        
                            
                                    L
                                
                                    G
                                    A
                                    C
                                    E
                                
                            =
                            -
                            
                                    ∑
                                    
                                        i
                                        =
                                        0
                                    
                                        M
                                        -
                                        1
                                    
                                    o
                                    n
                                    
                                            e
                                        
                                            h
                                            o
                                            t
                                            
                                                            z
                                                        
                                                            i
                                                        
                                    ∙
                                    
                                            log
                                        
                                        ⁡
                                        
                                                            g
                                                        
                                                            g
                                                            o
                                                            a
                                                            l
                                                            ,
                                                            i
                                                        
                                            8
                                        
                    ”);
train the system … according to the logarithmic loss function (Kim, Page 4, Section 3.3, Lines 24-25, “We complete the training procedure by optimizing the overall loss                         
                            
                                    L
                                
                                    t
                                    o
                                    t
                                    a
                                    l
                                
                     as the weighted sum of the two losses in Eq. 9”; Kim, Page 5, Equation 9, “                        
                            
                                    L
                                
                                    t
                                    o
                                    t
                                    a
                                    l
                                
                            =
                            
                                    L
                                
                                    R
                                    L
                                
                            +
                            η
                            
                                    L
                                
                                    G
                                    A
                                    C
                                    E
                                
                    ”); and 
provide the trained RL model (Kim, Page 10, Lines 1-6, “To ascertain that an agent trained with GACE and GACE&GDAN indeed becomes goal-aware, we use saliency maps [15] to visualize the operation of three agents within the V2 un seen task, as shown in Figure 5”; Kim, Page 10, Lines 11-14, “The three agents are trained with A3C, GACE, and GACE&GDAN, respectively, for 4M updates”; see also Kim, Page 10, Figure 5; In order for the model to be used in operation and develop saliency maps, it must have been provided). 
Kim does not explicitly teach defining… an initiation point for the RL model according to an optimized spectral norm of the RL model nor training the system from the initiation point…. 
Bjorck teaches defining… an initiation point for the RL model according to an optimized spectral norm of the RL model (Bjorck, Page 6, Lines 1-5, “Equation (5) suggest that the critic could be made smooth if the spectral norms of all layers are bounded. Fortunately, there is a method from the GAN literature which achieves this: spectral normalization [47]. Spectral normalization divides the weight W for each layer by its largest singular value                         
                            
                                    σ
                                
                                    m
                                    a
                                    x
                                
                     which ensures that all layers have operator norm 1”; Bjorck, Page 6, Lines 10-11=2, “By repeating this procedure for all layers, we ensure that the spectral norms of all layers are no larger than one. If that is the case, eq. (5) suggests that the critic should be stable in the forward pass. This would then bound the gradients being propagated into the actor as per eq. (3)”; The state of the model after spectral normalization, with bounded gradients, is considered to be the “initial point for the RL model”) and 
training the system from the initiation point… (Bjorck, Page 6, Section 5.1, Lines 3-9, “Specifically, for both the actor and the critic, we apply spectral normalization to each linear layer except the first and last. Otherwise, the setup follows Section 3.1. As before, when learning crashes, we simply use the performance recorded before crashes for future time steps. Learning curves for individual environments are given in Figure 4, again over 10 seeds. We see that after smoothing with spectral normalization, performance is relatively stable across tasks, even when using a deep network with normalization and skip connections. On the other hand, without smoothing, learning is slow and sometimes fails”).
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to have modified the reinforcement learning method of Kim to include defining an initiation point according to an optimized spectral norm as taught by Bjorck. The motivation to do so would have been that spectral normalization enables stable training with large modern architectures, which results in performance improvements of the model (Bjorck, Page 1, Abstract, Lines 13-17, “We demonstrate that spectral normalization (SN) can mitigate this issue and enable stable training with large modern architectures. After smoothing with SN, larger models yield significant performance improvements—suggesting that more “easy” gains may be had by focusing on model architectures in addition to algorithmic innovations”). 

Regarding claim 17, the rejection of claim 15 is incorporated.
Kim does not explicitly teach define an initiation point for the RL model according to the optimized spectral norm; and train the RL model from the initiation point. 
Bjorck teaches define an initiation point for the RL model according to the optimized spectral norm (Bjorck, Page 6, Lines 1-5, “Equation (5) suggest that the critic could be made smooth if the spectral norms of all layers are bounded. Fortunately, there is a method from the GAN literature which achieves this: spectral normalization [47]. Spectral normalization divides the weight W for each layer by its largest singular value                         
                            
                                    σ
                                
                                    m
                                    a
                                    x
                                
                     which ensures that all layers have operator norm 1”; Bjorck, Page 6, Lines 10-11=2, “By repeating this procedure for all layers, we ensure that the spectral norms of all layers are no larger than one. If that is the case, eq. (5) suggests that the critic should be stable in the forward pass. This would then bound the gradients being propagated into the actor as per eq. (3)”; The state of the model after spectral normalization, with bounded gradients, is considered to be the “initial point for the RL model”); and
train the RL model from the initiation point (Bjorck, Page 6, Section 5.1, Lines 3-9, “Specifically, for both the actor and the critic, we apply spectral normalization to each linear layer except the first and last. Otherwise, the setup follows Section 3.1. As before, when learning crashes, we simply use the performance recorded before crashes for future time steps. Learning curves for individual environments are given in Figure 4, again over 10 seeds. We see that after smoothing with spectral normalization, performance is relatively stable across tasks, even when using a deep network with normalization and skip connections. On the other hand, without smoothing, learning is slow and sometimes fails”).
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to have modified the reinforcement learning method of Kim to include defining an initiation point according to an optimized spectral norm as taught by Bjorck. The motivation to do so would have been that spectral normalization enables stable training with large modern architectures, which results in performance improvements of the model (Bjorck, Page 1, Abstract, Lines 13-17, “We demonstrate that spectral normalization (SN) can mitigate this issue and enable stable training with large modern architectures. After smoothing with SN, larger models yield significant performance improvements—suggesting that more “easy” gains may be had by focusing on model architectures in addition to algorithmic innovations”). 

Regarding claim 20, the rejection of claim 15 is incorporated, and further, Kim teaches formulate a decision process problem for the RL model (Kim, Page 3, Section 3.1, Lines 1-4, “Reinforcement learning (RL) from Sutton and Barto [41] aims to maximize cumulative rewards by trial-and-error in a Markov Decision Process (MDP). An MDP is defined by a tuple (S,A,R,P,γ), where S is the set of states, A is the set of actions, R : S × A → R is the reward function, P : S ×A×S →R is the transition probability distribution, and γ ∈ (0,1] is the discount factor”); 
define a logarithmic loss function for the RL model … (Kim, Page 4, Section 3.3, Lines 5-9, and Equations 3-5, “In our visual navigation experiments, we use asynchronous advantage actor-critic (A3C) [28] as the main algorithm, where the loss                         
                            
                                    L
                                
                                    R
                                    L
                                
                    is defined as the following                         
                            
                                    L
                                
                                    p
                                
                            =
                            ∇
                            l
                            o
                            g
                            π
                            
                                            a
                                        
                                            t
                                        
                                            s
                                        
                                            t
                                        
                                    ,
                                    I
                                
                                            R
                                        
                                            t
                                        
                                    -
                                    V
                                    
                                                    s
                                                
                                                    t
                                                
                                            ,
                                            I
                                        
                            +
                            β
                            ∇
                            H
                            
                                    π
                                    
                                                    a
                                                
                                                    t
                                                
                                                    s
                                                
                                                    t
                                                
                                            ,
                                            I
                                        
                                    3
                                
                                    L
                                
                                    v
                                
                            =
                            
                                                    R
                                                
                                                    t
                                                
                                            -
                                            V
                                            
                                                            s
                                                        
                                                            t
                                                        
                                                    ,
                                                    I
                                                
                                    2
                                
                                    4
                                
                                    L
                                
                                    R
                                    L
                                
                            ≔
                            
                                    L
                                
                                    A
                                    3
                                    C
                                
                            =
                            
                                    L
                                
                                    p
                                
                            +
                            0.5
                            ∙
                            
                                    L
                                
                                    v
                                
                                    5
                                
                    where                         
                            
                                    L
                                
                                    p
                                
                     and                         
                            
                                    L
                                
                                    v
                                
                     respectively denote policy and value loss,                         
                            
                                    R
                                
                                    t
                                
                     denotes the sum of decayed rewards from time steps t to T, and H and β denote the entropy term and its coefficient respectively”; Kim, Page 4, Section 3.3, Lines 15-17, “we propose Goal-Aware Cross-Entropy (GACE) loss as our contribution, which trains the goal-discriminator that facilitates semantic understanding of goals alongside the policy in Figure 1a”; Kim, Page 4, Section 3.3, Equation 8, “                        
                            
                                    L
                                
                                    G
                                    A
                                    C
                                    E
                                
                            =
                            -
                            
                                    ∑
                                    
                                        i
                                        =
                                        0
                                    
                                        M
                                        -
                                        1
                                    
                                    o
                                    n
                                    
                                            e
                                        
                                            h
                                            o
                                            t
                                            
                                                            z
                                                        
                                                            i
                                                        
                                    ∙
                                    
                                            log
                                        
                                        ⁡
                                        
                                                            g
                                                        
                                                            g
                                                            o
                                                            a
                                                            l
                                                            ,
                                                            i
                                                        
                                            8
                                        
                    ”);
train the system … according to the logarithmic loss function (Kim, Page 4, Section 3.3, Lines 24-25, “We complete the training procedure by optimizing the overall loss                         
                            
                                    L
                                
                                    t
                                    o
                                    t
                                    a
                                    l
                                
                     as the weighted sum of the two losses in Eq. 9”; Kim, Page 5, Equation 9, “                        
                            
                                    L
                                
                                    t
                                    o
                                    t
                                    a
                                    l
                                
                            =
                            
                                    L
                                
                                    R
                                    L
                                
                            +
                            η
                            
                                    L
                                
                                    G
                                    A
                                    C
                                    E
                                
                    ”); and 
provide the trained RL model (Kim, Page 10, Lines 1-6, “To ascertain that an agent trained with GACE and GACE&GDAN indeed becomes goal-aware, we use saliency maps [15] to visualize the operation of three agents within the V2 un seen task, as shown in Figure 5”; Kim, Page 10, Lines 11-14, “The three agents are trained with A3C, GACE, and GACE&GDAN, respectively, for 4M updates”; see also Kim, Page 10, Figure 5; In order for the model to be used in operation and develop saliency maps, it must have been provided). 
Kim does not explicitly teach defining… an initiation point for the RL model according to an optimized spectral norm of the RL model nor training the system from the initiation point…. 
Bjorck teaches defining… an initiation point for the RL model according to an optimized spectral norm of the RL model (Bjorck, Page 6, Lines 1-5, “Equation (5) suggest that the critic could be made smooth if the spectral norms of all layers are bounded. Fortunately, there is a method from the GAN literature which achieves this: spectral normalization [47]. Spectral normalization divides the weight W for each layer by its largest singular value                         
                            
                                    σ
                                
                                    m
                                    a
                                    x
                                
                     which ensures that all layers have operator norm 1”; Bjorck, Page 6, Lines 10-11=2, “By repeating this procedure for all layers, we ensure that the spectral norms of all layers are no larger than one. If that is the case, eq. (5) suggests that the critic should be stable in the forward pass. This would then bound the gradients being propagated into the actor as per eq. (3)”; The state of the model after spectral normalization, with bounded gradients, is considered to be the “initial point for the RL model”) and 
training the system from the initiation point… (Bjorck, Page 6, Section 5.1, Lines 3-9, “Specifically, for both the actor and the critic, we apply spectral normalization to each linear layer except the first and last. Otherwise, the setup follows Section 3.1. As before, when learning crashes, we simply use the performance recorded before crashes for future time steps. Learning curves for individual environments are given in Figure 4, again over 10 seeds. We see that after smoothing with spectral normalization, performance is relatively stable across tasks, even when using a deep network with normalization and skip connections. On the other hand, without smoothing, learning is slow and sometimes fails”).
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to have modified the reinforcement learning method of Kim to include defining an initiation point according to an optimized spectral norm as taught by Bjorck. The motivation to do so would have been that spectral normalization enables stable training with large modern architectures, which results in performance improvements of the model (Bjorck, Page 1, Abstract, Lines 13-17, “We demonstrate that spectral normalization (SN) can mitigate this issue and enable stable training with large modern architectures. After smoothing with SN, larger models yield significant performance improvements—suggesting that more “easy” gains may be had by focusing on model architectures in addition to algorithmic innovations”). 

Claims 4-5, 7, 11-12, 14, and 18-19 are rejected under 35 U.S.C. 103 as being unpatentable over Kim in view of Bjorck in further view of HAMBLY et al., "Policy gradient methods for the noisy linear quadratic regulator over a finite horizon". SIAM Journal on Control and Optimization, 59(5): 3359–3391, 2021, June 25, 2021, 49 pps, hereinafter referred to as “Hambly”. 

Regarding claim 4, the rejection for claim 3 is incorporated. 
The proposed combination does not explicitly teach wherein defining the initiation point comprises regulating a system spectral radius. 
Hambly teaches wherein defining the initiation point comprises regulating a system spectral radius (Hambly, Page 11, Remark 3.11, Lines 4-5, “Note that for the infinite horizon problem, the spectral radius of A - BK needs to be smaller than 1 to guarantee the stability of the system (see [23])”).
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to have modified the method of defining an initiation point as taught by the proposed combination to include regulating a system spectral radius as taught by Hambly. The motivation to do so would have been to guarantee the stability of the system (Hambly, Page 11, Remark 3.11, Lines 4-5). 

Regarding claim 5, the rejection of claim 4 is incorporated, and further, the proposed combination teaches wherein defining the initiation point comprises defining an initiation point wherein a magnitude of an absolute value of the system spectral radius is less than 1 (Hambly, Page 11, Remark 3.11, Lines 4-5, “Note that for the infinite horizon problem, the spectral radius of A - BK needs to be smaller than 1 to guarantee the stability of the system (see [23])”).

Regarding claim 7, the rejection of claim 6 is incorporated, and further, the proposed combination teaches performing the step of the method by the one or more computer processors (Kim, Page 5, Section 4, Paragraph 1, Lines 3-4, “We develop and conduct experiments on (1) visual navigation tasks based on ViZDoom [23, 18], and (2) robot arm manipulation tasks based on MuJoCo [42]”; Kim, Page 9, Table 3 and Figure 4; A person of ordinary skill in the art would recognize that “ViZDoom” and “MuJoCo” as well as the results displayed in Table 3 and Figure 4 would require the use of a computer, which also provides evidence for a computer processor).
The proposed combination does not explicitly teach defining, …, the initiation point according to a system spectral magnitude of less than 1. 
Hambly teaches defining, …, the initiation point according to a system spectral magnitude of less than 1 (Hambly, Page 11, Remark 3.11, Lines 4-5, “Note that for the infinite horizon problem, the spectral radius of A - BK needs to be smaller than 1 to guarantee the stability of the system (see [23])”).
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to have modified the method of defining an initiation point as taught by the proposed combination to include regulating a system spectral radius as taught by Hambly. The motivation to do so would have been to guarantee the stability of the system (Hambly, Page 11, Remark 3.11, Lines 4-5). 

	Regarding claim 11, the rejection of claim 10 is incorporated.
The proposed combination does not explicitly teach wherein defining the initiation point comprises regulating a system spectral radius. 
Hambly teaches wherein defining the initiation point comprises regulating a system spectral radius (Hambly, Page 11, Remark 3.11, Lines 4-5, “Note that for the infinite horizon problem, the spectral radius of A - BK needs to be smaller than 1 to guarantee the stability of the system (see [23])”).
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to have modified the method of defining an initiation point as taught by the proposed combination to include regulating a system spectral radius as taught by Hambly. The motivation to do so would have been to guarantee the stability of the system (Hambly, Page 11, Remark 3.11, Lines 4-5).  

Regarding claim 12, the rejection of claim 11 is incorporated, and further, the proposed combination teaches wherein defining the initiation point comprises defining an initiation point wherein a magnitude of an absolute value of the system spectral radius is less than 1 (Hambly, Page 11, Remark 3.11, Lines 4-5, “Note that for the infinite horizon problem, the spectral radius of A - BK needs to be smaller than 1 to guarantee the stability of the system (see [23])”).

Regarding claim 14, the rejection of claim 13 is incorporated. 
The proposed combination does not explicitly teach defining the initiation point according to a system spectral magnitude of less than 1. 
Hambly teaches defining the initiation point according to a system spectral magnitude of less than 1 (Hambly, Page 11, Remark 3.11, Lines 4-5, “Note that for the infinite horizon problem, the spectral radius of A - BK needs to be smaller than 1 to guarantee the stability of the system (see [23])”).
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to have modified the method of defining an initiation point as taught by the proposed combination to include regulating a system spectral radius as taught by Hambly. The motivation to do so would have been to guarantee the stability of the system (Hambly, Page 11, Remark 3.11, Lines 4-5). 

Regarding claim 18, the rejection of claim 17 is incorporated. 
The proposed combination does not explicitly teach wherein defining the initiation point comprises regulating a system spectral radius. 
Hambly teaches wherein defining the initiation point comprises regulating a system spectral radius (Hambly, Page 11, Remark 3.11, Lines 4-5, “Note that for the infinite horizon problem, the spectral radius of A - BK needs to be smaller than 1 to guarantee the stability of the system (see [23])”).
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to have modified the method of defining an initiation point as taught by the proposed combination to include regulating a system spectral radius as taught by Hambly. The motivation to do so would have been to guarantee the stability of the system (Hambly, Page 11, Remark 3.11, Lines 4-5).  

Regarding claim 19, the rejection of claim 18 is incorporated, and further, the proposed combination teaches wherein defining the initiation point comprises defining an initiation point wherein a magnitude of an absolute value of the system spectral radius is less than 1 (Hambly, Page 11, Remark 3.11, Lines 4-5, “Note that for the infinite horizon problem, the spectral radius of A - BK needs to be smaller than 1 to guarantee the stability of the system (see [23])”).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Gogianu et al., Spectral Normalisation for Deep Reinforcement Learning: An Optimisation Perspective, Proceedings of the 38th International Conference on Machine Learning, PMLR 139:3734-3744, 2021, https://proceedings.mlr.press/v139/gogianu21a.html discloses constraining the Lipschitz constant of a single layer using spectral normalization to elevate performance of a reinforcement learning agent. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MOLLY CLARKE SIPPEL whose telephone number is (571)272-3270. The examiner can normally be reached Monday - Friday, 7:30 a.m. - 4:30 p.m. ET..
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached at (571)272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/M.C.S./            Examiner, Art Unit 2122                                                                                                                                                                                            

/KAKALI CHAKI/            Supervisory Patent Examiner, Art Unit 2122
Read full office action
Prosecution Timeline

Dec 12, 2022
Application Filed
Nov 07, 2023
Response after Non-Final Action
Mar 03, 2026
Non-Final Rejection — §101, §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/929,541
Patent 12602592
NOISE COMMUNICATION FOR FEDERATED LEARNING
2y 5m to grant Granted Apr 14, 2026
17/932,941
Patent 12596916
CONSTRAINED MASKING FOR SPARSIFICATION IN MACHINE LEARNING
2y 5m to grant Granted Apr 07, 2026
Study what changed to get past this examiner. Based on 2 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds
Prosecution Projections

1-2
Expected OA Rounds
50%
Grant Probability
99%
With Interview (+58.3%)
3y 7m
Median Time to Grant
Low
PTA Risk
Based on 14 resolved cases by this examiner. Grant probability derived from career allow rate.
PROVIDING TRAINED REINFORCEMENT LEARNING SYSTEMS

This examiner grants 50% of cases after interview

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

PROVIDING TRAINED REINFORCEMENT LEARNING SYSTEMS

This examiner grants 50% of cases after interview

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email