DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This action is responsive to the claims filed 8/3/2023.
Claims 1-20 are presented for examination.
Drawings
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(5) because they include the following reference character(s) not mentioned in the description: 1000 (Fig. 10). Corrected drawing sheets in compliance with 37 CFR 1.121(d), or amendment to the specification to add the reference character(s) in the description in compliance with 37 CFR 1.121(b) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either "Replacement Sheet" or "New Sheet" pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
Specification
The use of the term TENSORFLOW, which is a trade name or a mark used in commerce, has been noted in this application. The term should be accompanied by the generic terminology; furthermore the term should be capitalized wherever it appears or, where appropriate, include a proper symbol indicating use in commerce such as ™, SM, or ® following the term.
Although the use of trade names and marks used in commerce (i.e., trademarks, service marks, certification marks, and collective marks) are permissible in patent applications, the proprietary nature of the marks should be respected and every effort made to prevent their use in any manner which might adversely affect their validity as commercial marks.
The disclosure is objected to because of the following informalities:
Para. [0011]: "generates by" should read "generated by" (grammatical error).
Para. [0041]: "intermediate outputs 130, 132, and 136" should read "intermediate outputs 130, 132, and 134" (incorrect reference number; 136 is a state value estimate, not an intermediate output).
Para. [0044]: "Ntime" should read "N time" (missing space; cf. Para. [0069] which correctly uses "N time").
Appropriate correction is required.
Claim Objections
Claim 7 is objected to because of the following informalities:
Claim 7 limitations recite in part "auxiliary input to an auxiliary prediction neural network" should be --auxiliary input to the auxiliary prediction neural network-- (bolded for emphasis) because claim 7 depends on claim 6, which already establishes the specific neural network being updated. Using the indefinite article "an" in Claim 7 improperly introduced a new element rather than referring back to the established element. The amendment to "the" provides proper antecedent basis.
Appropriate correction is required.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Independent claim 1, lines 8-9 in the claim limitations, recites in part "a proper subset of the set of features representing the current observation". There is insufficient antecedent basis for “the current observation” in the claim. The claim introduces only "an observation" (singular) in the earlier limitation "receiving a set of features representing an observation. It is unclear what the scope of “the current observation” is and if/how it relates to “an observation”. Thus, claim 1 is indefinite. For the purposes of examination said limitations are interpreted as “a proper subset of the set of features representing a current observation”.
Independent claim 1, line 15 in the claim limitations, recites in part "the observations for the sequence of time steps" in the limitation describing processing the auxiliary input using the auxiliary prediction neural network. There is insufficient antecedent basis for this limitation in the claim. The claim introduces only "an observation" (singular) in the earlier limitation "receiving a set of features representing an observation." The plural form "the observations" has not been previously recited. It is unclear what the scope of “the observations” is. Thus, claim 1 is indefinite. For the purposes of examination said limitations as “observations for the sequence of time steps”.
Independent claim 1, lines 16-17 in the claim limitations, recites in part "a respective intermediate output generated by each auxiliary neural network" in the limitation describing processing an input using the action selection neural network. There is insufficient antecedent basis for this limitation in the claim. The claim introduces "one or more auxiliary prediction neural networks" but subsequently refers to "each auxiliary neural network," omitting the word "prediction." It is unclear whether "each auxiliary neural network" refers to the previously recited "auxiliary prediction neural networks" or to a different component. Thus, claim 1 is indefinite. For the purposes of examination said limitations as “a respective intermediate output generated by each auxiliary prediction neural network”.
Dependent claims 2-18 do not cure the deficiencies of base claim 1 and thus claims 2-18 are also rejected under 35 U.S.C. 112(b) for at least being dependent on the rejected base claim 1.
Dependent claim 3, lines 2-3 in the claim limitations, recites in part "a time-discounted sum of the corresponding auxiliary rewards." There is insufficient antecedent basis for this limitation in the claim. Claim 1, from which claim 3 ultimately depends, introduces only "a corresponding auxiliary reward" (singular). The plural form "the corresponding auxiliary rewards" has not been previously recited in claim 3 or in any claim from which it depends. Thus, claim 3 is additionally indefinite. For the purposes of examination said limitations as “a time-discounted sum of corresponding auxiliary rewards.”
Dependent claim 5, lines 5-6 in the claim limitations, recites in part "training each auxiliary prediction neural network based on the corresponding auxiliary rewards using reinforcement learning." There is insufficient antecedent basis for this limitation in the claim. Claim 1, from which claim 5 depends, introduces only "a corresponding auxiliary reward" (singular). The plural form "the corresponding auxiliary rewards" has not been previously recited. Thus, claim 5 is additionally indefinite. For the purposes of examination said limitations as “training each auxiliary prediction neural network based on corresponding auxiliary rewards using reinforcement learning.”
Independent claim 19 is a system claim and claim 20 is a one or more non-transitory computer storage media claim of independent claim 1, claims 19-20 recite substantially the same limitations as claim 1 that lack antecedent basis and thus are rejected under 35 U.S.C. 112(b) for the same reasons set forth above with respect to claim 1.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The analysis of the claims will follow the 2019 Revised Patent Subject Matter Eligibility Guidance, 84 Fed. Reg. 50 (“2019 PEG”).
Claim 1
Step 1: This claim recites “A method…the method comprising”; therefore, it is directed to the statutory category of a process.
Step 2A Prong 1: This claim recites, inter alia:
A method performed for selecting actions to be performed by an agent to interact with an environment to perform a main task: These limitations recite a mentally performable process of using judgement with aid of pen and paper to select actions designated to be performed by an agent to interact with an environment to perform a main task.
determining an auxiliary input to the auxiliary prediction neural network, wherein the auxiliary input comprises a proper subset of the set of features representing the current (interpreted as a current per the 35 U.S.C. 112(b) rejection set forth above) observation: These limitations recite mentally performable process of using judgement with aid of pen and paper to determine an auxiliary input intended for the auxiliary prediction network, wherein the auxiliary input comprises a subset of features representing a current observation.
processing the auxiliary input using the auxiliary prediction neural network, wherein: the auxiliary prediction neural network is configured to generate a state value estimate for the current state of the environment relative to a corresponding auxiliary reward that measures values of a corresponding target feature from the set of features representing the observations (interpreted as representing observations per the 35 U.S.C. 112(b) rejection set forth above) for the sequence of time steps: These limitations recite a mathematical relationship of organizing information and manipulating information, e.g. processing the auxiliary input using the auxiliary prediction neural network, through mathematical correlations, e.g. wherein: the auxiliary prediction neural network is configured to generate a state value estimate for the current state of the environment relative to a corresponding auxiliary reward that measures values of a corresponding target feature from the set of features representing the observations for the sequence of time steps.
and selecting the action to be performed by the agent at the time step: These limitations recite a mentally performable process of using judgement to select the action to be performed by the agent at the observed time step.
Thus, this claim recites a judicial exception.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The additional elements of this claim are as follows:
by one or more computers...for each of one or more auxiliary prediction neural networks: These additional elements are recited at a high level of generality and merely amount to invoking computers or other machinery merely as a tool to apply the underlying judicial exception corresponding selecting actions and determining an auxiliary input. See MPEP 2106.05(f).
the method comprising, for each time step in a sequence of time steps…wherein the observation characterizes a current state of the environment at the time step: These additional elements merely represent generally linking the underlying judicial exception to a field of use or technological environment. See MPEP 2106.05(h).
receiving a set of features representing an observation: These additional elements merely recite insignificant extra-solution activity of mere data gathering, e.g. receiving a set of features representing an observation. See MPEP 2106.05(g).
processing an input comprising a respective intermediate output generated by each auxiliary neural network (interpreted as by each auxiliary prediction neural network per the 35 U.S.C. 112(b) rejection set forth above) at the time step using an action selection neural network to generate an action selection output…using the action selection output: These additional elements are recited at a high level of generality reciting results of processing an input comprising a respective intermediate output generated by each auxiliary prediction neural network at the time step using an action selection neural network to generate an action selection output and using the action selection output but fail to provide any inventive particulars or details as to how processing the input to generate an action selection output occurs using an action selection neural network, e.g. no details as to how this differs from a generic feed-forward execution of neural networks, thus these limitations merely amounts to “apply it” or equivalent instructions to the abstract idea of selecting the action. See MPEP 2106.05(f).
Thus, the way in which the additional elements use or interact with the judicial exception do not integrate the judicial exception into a practical application when this claim is considered as a whole.
Step 2B: The additional elements from Step 2A Prong 2 include invoking computer machinery to apply the underlying judicial exception, generally linking the abstract idea to a field of use or technological environment, and insignificant extra-solution activity of data gathering recited by "receiving a set of features representing an observation" which is a well-understood routine and conventional activity similar to presenting offers and gathering statistics see MPEP 2106.05(d)(II). Additionally, the additional elements include mere instructions to implement an abstract idea on a computer and with “apply it” or equivalent instructions. Thus, the additional elements, viewed individually or in combination, do not provide an inventive concept or otherwise amount to significantly more than the abstract idea itself. See MPEP 2106.05.
Claim 2
Step 1: a process, as in claim 1.
Step 2A Prong 1: This claim recites, inter alia:
wherein for each auxiliary prediction neural network, the state value estimate for the current state of the environment relative to the corresponding auxiliary reward defines an estimate of a cumulative measure of the corresponding auxiliary reward to be received over future time steps: These limitations recite further mathematical relationships relating at least the state value estimate for the current state of the environment relative to the corresponding auxiliary reward defines an estimate of a cumulative measure.
Thus, this claim furthers the judicial exception.
Step 2A Prong 2 & Step 2B: There are no additional elements recited so this claim does not provide a practical application and is not considered to be significantly more. As such, the claim is patent ineligible.
Claim 3
Step 1: a process, as in claim 2.
Step 2A Prong 1: The claim recites, inter alia:
wherein for each auxiliary prediction neural network, the cumulative measure of the corresponding auxiliary reward comprises a time-discounted sum of the corresponding (interpreted as sum of corresponding per the 35 U.S.C. 112(b) rejection set forth above) auxiliary rewards: These limitations recite further mathematical relationships relating the cumulative measure of the corresponding auxiliary reward with a time-discounted sum of corresponding auxiliary rewards.
Thus, this claim furthers the judicial exception.
Step 2A Prong 2 & Step 2B: There are no additional elements recited so this claim does not provide a practical application and is not considered to be significantly more. As such, the claim is patent ineligible.
Claim 4
Step 1: a process, as in claim 1.
Step 2A Prong 1: This claim recites the same judicial exception as in claim 1.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The additional elements of this claim are as follows:
further comprising: receiving a respective main task reward for each time step in the sequence of time steps: These additional elements merely recite insignificant extra-solution activity of mere data gathering. See MPEP 2106.05(g).
and training the action selection neural network based on the main task rewards using reinforcement learning: These additional elements are recited at a high level of generality reciting results of training the action selection neural network based on the main task rewards using reinforcement learning but fail to provide any inventive particulars or details as to how training using reinforcement learning was accomplished and no particulars of the action select neural network, thus these limitations merely amounts to “apply it” or equivalent instructions to the abstract idea of selecting the action. See MPEP 2106.05(f).
Thus, the way in which the additional elements use or interact with the judicial exception do not integrate the judicial exception into a practical application when this claim is considered as a whole.
Step 2B: The additional elements from Step 2A Prong 2 include insignificant extra-solution activity of data gathering recited by " further comprising: receiving a respective main task reward for each time step in the sequence of time steps " which is a well-understood routine and conventional activity similar to presenting offers and gathering statistics see MPEP 2106.05(d)(II). Additionally, the additional elements include mere instructions to implement an abstract idea on a computer and with “apply it” or equivalent instructions. Thus, the additional elements, viewed individually or in combination, do not provide an inventive concept or otherwise amount to significantly more than the abstract idea itself. See MPEP 2106.05.
Claim 5
Step 1: a process, as in claim 1.
Step 2A Prong 1: The claim recites, inter alia:
further comprising: for each time step in the sequence of time steps: determining, for each auxiliary prediction neural network, the auxiliary reward for the time step based on the value of the corresponding (interpreted as value of corresponding per the 35 U.S.C. 112(b) rejection set forth above) target feature at the time step: These limitations recite further mentally performable processes with the aid of pen and paper of using judgement to determine, for each auxiliary prediction neural network, the auxiliary reward for the time step based on observing the value of corresponding target feature at each time step in the sequence of time steps.
Thus, this claim furthers the judicial exception.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The additional elements of this claim are as follows:
and training each auxiliary prediction neural network based on the corresponding auxiliary rewards using reinforcement learning: These additional elements are recited at a high level of generality reciting results of training each auxiliary prediction neural network based on the corresponding auxiliary rewards using reinforcement learning but fail to provide any inventive particulars or details as to how training using reinforcement learning was accomplished for each auxiliary prediction neural network based on the corresponding auxiliary rewards and no particulars of the auxiliary prediction neural network architecture/mechanism, thus these limitations merely amounts to “apply it” or equivalent instructions to the abstract idea of selecting the action. See MPEP 2106.05(f).
Thus, the way in which the additional elements use or interact with the judicial exception do not integrate the judicial exception into a practical application when this claim is considered as a whole.
Step 2B: The additional elements from Step 2A Prong 2 include mere instructions to implement an abstract idea on a computer and with “apply it” or equivalent instructions. Thus, the additional elements, viewed individually or in combination, do not provide an inventive concept or otherwise amount to significantly more than the abstract idea itself. See MPEP 2106.05.
Claim 6
Step 1: a process, as in claim 1.
Step 2A Prong 1: This claim recites, inter alia:
further comprising, at each of one or more time steps in the sequence of time steps: updating, for one or more of the auxiliary prediction neural networks, data that defines the proper subset of the set of features that are designated to be included in the auxiliary input to the auxiliary prediction neural network: These limitations recite a mentally performable process with aid of pen and paper of observing at each of one or more time steps in the sequence of time steps and using judgement to update data, designated for one or more of the auxiliary prediction neural networks, defining the proper subset of the set of features that are designated to be included in the auxiliary input to the auxiliary prediction neural network.
Thus, this claim furthers the judicial exception.
Step 2A Prong 2 & Step 2B: There are no additional elements recited so this claim does not provide a practical application and is not considered to be significantly more. As such, the claim is patent ineligible.
Claim 7
Step 1: a process, as in claim 6.
Step 2A Prong 1: This claim recites, inter alia:
wherein updating the data that defines the proper subset of the set of features that are designated to be included in the auxiliary input to an auxiliary prediction neural network comprises: determining, for each feature in the set of features, a respective first importance score characterizing an importance of the feature to predicting state values relative to the auxiliary reward: These limitations furthers the recited mentally performable process in claim 6 with mathematical relationships of organizing information and manipulating information, e.g. determining, for each feature in the set of features, through mathematical correlations, e.g. a respective first importance score characterizing an importance of the feature to predicting state values relative to the auxiliary reward.
and updating the data defining the proper subset of the set of features that are designated to be included in the auxiliary input to the auxiliary prediction neural network based on the first importance scores: These limitations further recite a mathematical relationship of organizing information and manipulating information, e.g. updating the data defining the proper subset of the set of features that are designated to be included in the auxiliary input to the auxiliary prediction neural network, through mathematical correlations, e.g. defining the proper subset of the set of features based on the first importance scores.
Thus, this claim furthers the judicial exception.
Step 2A Prong 2 & Step 2B: There are no additional elements recited so this claim does not provide a practical application and is not considered to be significantly more. As such, the claim is patent ineligible.
Claim 8
Step 1: a process, as in claim 7.
Step 2A Prong 1: This claim recites, inter alia:
wherein determining the respective first importance score for each feature in the set of features comprises: obtaining a state value function that is configured to process the set of features to generate a state value estimate for the current state of the environment relative to the corresponding auxiliary reward: These limitations recite mathematical calculations of generate a state value estimate for the current state of the environment relative to the corresponding auxiliary reward when obtaining a state value function included in the mathematical relationship of determining the respective first importance score for each feature in the set of features.
and determining the first importance score for each feature in the set of features using the state value function: These limitations further recite a mathematical relationship of organizing information and manipulating information, e.g. determining the first importance score for each feature in the set of features, through mathematical correlations, e.g. using the state value function.
Thus, this claim furthers the judicial exception.
Step 2A Prong 2 & Step 2B: There are no additional elements recited so this claim does not provide a practical application and is not considered to be significantly more. As such, the claim is patent ineligible.
Claim 9
Step 1: a process, as in claim 8.
Step 2A Prong 1: This claim recites, inter alia:
wherein the state value function is a linear function that comprises a respective parameter corresponding to each feature in the set of features, and wherein determining the first importance score for each feature in the set of features using the state value function comprises: determining the first importance score for each feature based on a value of the corresponding parameter of the state value function: These limitations recite a mathematical relationship of organizing information and manipulating information, e.g. determining the first importance score for each feature in the set of features using the state value function, through mathematical correlations, e.g. determining the first importance score for each feature based on a value of the corresponding parameter of the state value function, wherein the state value function is a linear function that comprises a respective parameter corresponding to each feature in the set of features.
Thus, this claim furthers the judicial exception.
Step 2A Prong 2 & Step 2B: There are no additional elements recited so this claim does not provide a practical application and is not considered to be significantly more. As such, the claim is patent ineligible.
Claim 10
Step 1: a process, as in claim 8.
Step 2A Prong 1: This claim recites the same judicial exception as in claim 8.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The additional elements of this claim are as follows:
wherein for each time step in the sequence of time steps, the state value function is trained based on the auxiliary reward for the time step using reinforcement learning: These additional elements are recited at a high level of generality reciting results of training the state value function for each time step in the sequence of time steps using reinforcement learning based on the auxiliary reward for the time step but fail to provide any inventive particulars or details as to how training using reinforcement learning was accomplished for each time step in the sequence of time steps based on the auxiliary reward for the time step, e.g. is the reward binary for training or particularly weighted/scaled, thus these limitations merely amounts to “apply it” or equivalent instructions to the abstract idea of the mathematical calculations defined by the state value function. See MPEP 2106.05(f).
Thus, the way in which the additional elements use or interact with the judicial exception do not integrate the judicial exception into a practical application when this claim is considered as a whole.
Step 2B: The additional elements from Step 2A Prong 2 include mere instructions to implement an abstract idea on a computer and with “apply it” or equivalent instructions. Thus, the additional elements, viewed individually or in combination, do not provide an inventive concept or otherwise amount to significantly more than the abstract idea itself. See MPEP 2106.05.
Claim 11
Step 1: a process, as in claim 1.
Step 2A Prong 1: This claim recites, inter alia:
further comprising, at each of one or more time steps: updating data that defines the respective target features that specify the auxiliary rewards for the auxiliary prediction neural networks: These limitations recite a mentally performable process with aid of pen and paper of observing at each of one or more time steps and using judgement to update data designated to define the respective target features that specify the auxiliary rewards for the auxiliary prediction neural networks.
Thus, this claim furthers the judicial exception.
Step 2A Prong 2 & Step 2B: There are no additional elements recited so this claim does not provide a practical application and is not considered to be significantly more. As such, the claim is patent ineligible.
Claim 12
Step 1: a process, as in claim 11.
Step 2A Prong 1: This claim recites, inter alia:
wherein updating the data that defines the respective target features that specify the auxiliary rewards for the auxiliary prediction neural networks comprises: determining, for each feature in the set of features, a respective second importance score characterizing an importance of the feature to predicting main task rewards: These limitations furthers the recited mentally performable process in claim 1 with mathematical relationships of organizing information and manipulating information, e.g. determining, for each feature in the set of features, through mathematical correlations, e.g. a respective second importance score characterizing an importance of the feature to predicting main task rewards.
and updating data that defines the respective target features that specify the auxiliary rewards for the auxiliary prediction neural networks based on the second importance scores: These limitations further recite a mathematical relationship of organizing information and manipulating information, e.g. updating data that defines the respective target features that specify the auxiliary rewards for the auxiliary prediction neural networks, through mathematical correlations, e.g. defining the respective target features based on the second importance scores.
Thus, this claim furthers the judicial exception.
Step 2A Prong 2 & Step 2B: There are no additional elements recited so this claim does not provide a practical application and is not considered to be significantly more. As such, the claim is patent ineligible.
Claim 13
Step 1: a process, as in claim 12.
Step 2A Prong 1: This claim recites, inter alia:
wherein determining the respective second importance score for each feature in the set of features comprises: obtaining a main task reward estimation function that is configured to process the set of features representing an observation for a time step to generate a prediction for a main task reward received at a next time step: These limitations recite mathematical calculations of determining the respective second importance score for each feature in the set of features with a mathematical relationship of main task reward estimation function using the set of features representing an observation for a time step to generate a prediction for a main task reward received at a next time step.
and determining the second importance score for each feature in the set of features using the main task reward estimation function: These limitations further recite a mathematical relationship of organizing information and manipulating information, e.g. determining the second importance score for each feature in the set of features, through mathematical correlations, e.g. using the main task reward estimation function.
Thus, this claim furthers the judicial exception.
Step 2A Prong 2 & Step 2B: There are no additional elements recited so this claim does not provide a practical application and is not considered to be significantly more. As such, the claim is patent ineligible.
Claim 14
Step 1: a process, as in claim 13.
Step 2A Prong 1: This claim recites, inter alia:
wherein the main task reward estimation function is a linear function that comprises a respective parameter corresponding to each feature in the set of features, and wherein determining the second importance score for each feature in the set of features using the main task reward estimation function comprises: determining the second importance score for each feature based on a value of the corresponding parameter of the main task reward estimation function: These limitations recite a mathematical relationship of organizing information and manipulating information, e.g. determining the second importance score for each feature in the set of features using the main task reward estimation function, through mathematical correlations, e.g. determining the second importance score for each feature based on a value of the corresponding parameter of the main task reward estimation function, wherein the main task reward estimation function is a linear function that comprises a respective parameter corresponding to each feature in the set of features.
Thus, this claim furthers the judicial exception.
Step 2A Prong 2 & Step 2B: There are no additional elements recited so this claim does not provide a practical application and is not considered to be significantly more. As such, the claim is patent ineligible.
Claim 15
Step 1: a process, as in claim 13.
Step 2A Prong 1: This claim recites the same judicial exception as in claim 13.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The additional elements of this claim are as follows:
wherein for each time step in the sequence of time steps, the main task reward estimation function is trained based on the main task reward for the time step using supervised learning: These additional elements are recited at a high level of generality reciting results of training the main task reward estimation function for each time step in the sequence of time steps using supervised learning based on the main task reward for the time step but fail to provide any inventive particulars or details how training using supervised learning was accomplished for each time step in the sequence of time steps based on the main task reward for the time step reflects the technological solution described in the Applicant’s specification, thus these limitations merely amounts to “apply it” or equivalent instructions to the abstract idea of the mathematical calculations defined by the state value function. See MPEP 2106.05(f).
Thus, the way in which the additional elements use or interact with the judicial exception do not integrate the judicial exception into a practical application when this claim is considered as a whole.
Step 2B: The additional elements from Step 2A Prong 2 include mere instructions to implement an abstract idea on a computer and with “apply it” or equivalent instructions. Thus, the additional elements, viewed individually or in combination, do not provide an inventive concept or otherwise amount to significantly more than the abstract idea itself. See MPEP 2106.05.
Claim 16
Step 1: a process, as in claim 1.
Step 2A Prong 1: This claim recites, inter alia:
wherein: each auxiliary prediction neural network generates a respective state value estimate for the current state of the environment relative to the corresponding auxiliary reward: These limitations recite a mathematical relationship of organizing information and manipulating information, e.g. wherein each auxiliary prediction neural network generates a respective state value estimate for the current state of the environment, through mathematical correlations, e.g. a respective state value estimate for the current state of the environment relative to the corresponding auxiliary reward.
and the input to the action selection neural network further comprises the respective state value estimate generated by each auxiliary prediction neural network: These limitations recite mathematical calculations describing generating by each auxiliary prediction neural network the respective state value estimate comprising the input to the action selection neural network.
Thus, this claim furthers the judicial exception.
Step 2A Prong 2 & Step 2B: There are no additional elements recited so this claim does not provide a practical application and is not considered to be significantly more. As such, the claim is patent ineligible.
Claim 17
Step 1: a process, as in claim 1.
Step 2A Prong 1: This claim recites, inter alia:
further comprising, prior to the first time step in the sequence of time steps and for each auxiliary prediction neural network: selecting a proper subset of the set of features to be included in the auxiliary input to the auxiliary prediction neural network: These limitations recite a mathematical relationship of organizing information and manipulating information, e.g. prior to the first time step in the sequence of time steps and for each auxiliary prediction neural network, through mathematical correlations, e.g. selecting a proper subset of the set of features to be included in the auxiliary input to the auxiliary prediction neural network.
comprising: randomly sampling a proper subset of the set of features; and designating the randomly sampled proper subset of the set of features for inclusion in the auxiliary input to the auxiliary prediction neural network: These limitations recite a mathematical relationship of organizing information and manipulating information, e.g. designating the randomly sampled proper subset of the set of features for inclusion in the auxiliary input to the auxiliary prediction neural network, through mathematical correlations, e.g. randomly sampling a proper subset of the set of features.
Thus, this claim furthers the judicial exception.
Step 2A Prong 2 & Step 2B: There are no additional elements recited so this claim does not provide a practical application and is not considered to be significantly more. As such, the claim is patent ineligible.
Claim 18
Step 1: a process, as in claim 1.
Step 2A Prong 1: This claim recites, inter alia:
further comprising, prior to the first time step in the sequence of time steps and for each auxiliary prediction neural network: randomly sampling a feature from the set of features; and designating the randomly sampled feature as the target feature corresponding to the auxiliary prediction neural network: These limitations recite a mathematical relationship of organizing information and manipulating information, e.g. prior to the first time step in the sequence of time steps and for each auxiliary prediction neural network and designating the randomly sampled feature as the target feature corresponding to the auxiliary prediction neural network, through mathematical correlations, e.g. randomly sampling a feature from the set of features.
Thus, this claim furthers the judicial exception.
Step 2A Prong 2 & Step 2B: There are no additional elements recited so this claim does not provide a practical application and is not considered to be significantly more. As such, the claim is patent ineligible.
Claim 19
Step 1: This claims is directed to “A system comprising”; therefore, this claim is directed to the statutory category of machines.
Step 2A Prong 1: This claim recites the same judicial exception as in claim 1.
Step 2A Prong 2: The judicial exception recited in this claim is not integrated into a practical application. The only substantive difference between claim 19 and claim 1 is that claim 19 is directed to “A system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform operations”. However, mere recitation that a judicial exception is to be performed using generic computer components in their ordinary capacity, e.g. one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform operations, cannot meaningfully integrate the judicial exception into a practical application. See MPEP 2106.05(f). With that exception, the analysis at this step for claim 19 mirrors that of claim 1.
Step 2B: The additional elements from Step 2A Prong 2 do not contain significantly more than the judicial exception for this claim. The only substantive difference between claim 19 and claim 1 is that claim 19 is directed to “A system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform operations”. However, mere recitation that a judicial exception is to be performed using generic computer components in their ordinary capacity, e.g. one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform operations, cannot amount to significantly more than the judicial exception. See MPEP 2106.05(f). With that exception, the analysis at this step for claim 19 mirrors that of claim 1.
Claim 20
Step 1: This claims is directed to “A system comprising”; therefore, this claim is directed to the statutory category of machines.
Step 2A Prong 1: This claim recites the same judicial exception as in claim 1.
Step 2A Prong 2: The judicial exception recited in this claim is not integrated into a practical application. The only substantive difference between claim 19 and claim 1 is that claim 19 is directed to “A system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform operations”. However, mere recitation that a judicial exception is to be performed using generic computer components in their ordinary capacity, e.g. one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform operations, cannot meaningfully integrate the judicial exception into a practical application. See MPEP 2106.05(f). With that exception, the analysis at this step for claim 19 mirrors that of claim 1.
Step 2B: The additional elements from Step 2A Prong 2 do not contain significantly more than the judicial exception for this claim. The only substantive difference between claim 19 and claim 1 is that claim 19 is directed to “A system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform operations”. However, mere recitation that a judicial exception is to be performed using generic computer components in their ordinary capacity, e.g. one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform operations, cannot amount to significantly more than the judicial exception. See MPEP 2106.05(f). With that exception, the analysis at this step for claim 19 mirrors that of claim 1.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-15 and 17-20 are rejected under 35 U.S.C. 103 as being unpatentable over Veeriah et al. (hereinafter Veeriah) “Discovery of Useful Questions as Auxiliary Tasks” (2019) in view of Jaderberg et al. (hereinafter Jaderberg) “Reinforcement Learning with Unsupervised Auxiliary Tasks" (2016).
Jaderberg was disclosed in an IDS dated 8/9/2024.
Regarding independent claim 1, Veeriah teaches a method performed by one or more computers for selecting actions to be performed by an agent to interact with an environment to perform a main task (ABSTRACT presents a method for a reinforcement learning (RL) agent to discover questions formulated as general value functions (GVFs)...as an auxiliary task, induces useful representations for the main task (to interact with an environment to perform a main task) faced by the RL agent (to be performed by an agent) which is necessarily performed by one or more computers (a method performed by one or more computers), Section 2.4 explains that a main-task network, given a state, estimates a policy to determine the actions (for selecting actions)), the method comprising, for each time step in a sequence of time steps (Section 2.2 describes an update procedure that modifies parameters "on each step t", establishing a sequence of time steps);
receiving a set of features representing an observation, wherein the observation characterizes a current state of the environment at the time step; (Section 2.1 teaches that the first network takes the last i observations, associated with time steps including step t, as inputs to characterize the current state of the environment);
for each of one or more auxiliary prediction neural networks: processing the auxiliary input using the auxiliary prediction neural network (Section 2.4 describes an "answer network" that approximates the GVF answers (the method comprising, for each of one or more auxiliary prediction neural networks: processing the auxiliary input using the auxiliary prediction neural network)), wherein: the auxiliary prediction neural network is configured to generate a state value estimate for the current state of the environment relative to a corresponding auxiliary reward that measures values of a corresponding target feature from the set of features representing the observations (interpreted as representing observations per the 35 U.S.C. 112(b) rejection set forth above) for the sequence of time steps (Section 2.1 explains that "A GVF-question is specified by a cumulant function, a discount function and a policy" and the answer network "parameterises... GVF-predictions for a number of discovered cumulants and discounts", where the cumulant functions act as auxiliary rewards to generate state value estimates for the current state (wherein the auxiliary prediction neural network is configured to generate a state value estimate for the current state of the environment relative to a corresponding auxiliary reward), and further teaches that "each cumulant is a function of future observations", which corresponds to measuring values of a corresponding target feature from the set of features representing the observations for the sequence of time steps (that measures values of a corresponding target feature from the set of features representing observations for the sequence of time steps));
processing an input comprising a respective intermediate output generated by each auxiliary neural network (interpreted as by each auxiliary prediction neural network per the 35 U.S.C. 112(b) rejection set forth above) at the time step using an action selection neural network to generate an action selection output; and (Section 2.1 and Section 2.4 teach an architecture that "parameterises (directly or indirectly) a policy…for the main reinforcement learning task, together with GVF-predictions for a number of discovered cumulants and discounts" (processing an input comprising generated by each auxiliary prediction neural network), where an encoder network outputs a state representation x_t (a respective intermediate output at the time stamp) that is processed by a main task network (using an action selection neural network) to estimate "both the policy…and a state value function v" (to generate an action selection output));
selecting the action to be performed by the agent at the time step using the action selection output. (Section 2.4 teaches that the main task network estimates a "policy", which defines the agent's behavior and is used to determine and select the actions to be performed by the agent at the time step (selecting the action to be performed by the agent at the time step using the action selection output)).
Veeriah does not expressly teach determining an auxiliary input to the auxiliary prediction neural network, wherein the auxiliary input comprises a proper subset of the set of features representing the current observation (interpreted as a current observation per the 35 U.S.C. 112(b) rejection set forth above).
However, Jaderberg teaches determining an auxiliary input to the auxiliary prediction neural network, wherein the auxiliary input comprises a proper subset of the set of features representing a current observation (Section 3.1 teaches defining auxiliary tasks whose objectives are based on specific subsets of the observation feature space rather than the full observation, such as "pixel control" where the agent learns a separate policy for "maximally changing the pixels in each cell of an n x n non-overlapping grid placed over the input image" and "feature control" where the agent learns to maximize "activations of units in a specific hidden layer", which corresponds to restricting the input to a specific portion or subset of the overall features).
Because Veeriah and Jaderberg both address the issue of improving reinforcement learning agent performance and representation learning through auxiliary prediction and control objectives, accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Veeriah by incorporating the teachings of partitioning the observation feature space among auxiliary tasks as suggested by Jaderberg, with a reasonable expectation of success, to restrict each auxiliary prediction neural network's input in Veeriah to the subset of features corresponding to its specific auxiliary prediction objective (e.g., specific grid cells or hidden units) to teach determining an auxiliary input to the auxiliary prediction neural network, wherein the auxiliary input comprises a proper subset of the set of features representing a current observation. This modification would have been motivated by the desire to align the auxiliary network's input directly with its specific objective, reduce the dimensionality of the input to decrease computational burden, and focus the learning process on the features most relevant to each auxiliary prediction by reducing noise from irrelevant features (Jaderberg Section 3.1).
Regarding dependent claim 2, Veeriah, in view of Jaderberg, teach the method of claim 1, wherein for each auxiliary prediction neural network, the state value estimate for the current state of the environment relative to the corresponding auxiliary reward defines an estimate of a cumulative measure of the corresponding auxiliary reward to be received over future time steps (see Veeriah Section 2.4 which explains that the answer network estimates "an expected cumulative discounted sum of cumulants", where the cumulants act as the auxiliary rewards, which corresponds to defining an estimate of a cumulative measure of the corresponding auxiliary reward to be received over future time steps).
Regarding dependent claim 3, Veeriah, in view of Jaderberg, teach the method of claim 2, wherein for each auxiliary prediction neural network, the cumulative measure of the corresponding auxiliary reward comprises a time-discounted sum of the corresponding auxiliary rewards (interpreted as a time-discounted sum of corresponding auxiliary rewards per the 35 U.S.C. 112(b) rejection set forth above and taught by Veeriah Section 2.4 explicitly states that the answers estimate "an expected cumulative discounted sum of cumulants", which corresponds to a time-discounted sum of corresponding auxiliary rewards).
Regarding dependent claim 4, Veeriah, in view of Jaderberg, teach the method of claim 1, further comprising: receiving a respective main task reward for each time step in the sequence of time steps (see Veeriah Section 2.4 describing that the main-task network parameters are updated using an RL component that relies on a multi-step truncated return which includes the main task rewards received at each time step, Section 3.1 describes environments where the agent receives rewards at time steps, which corresponds to receiving a respective main task reward for each time step); and training the action selection neural network based on the main task rewards using reinforcement learning (see Veeriah Section 2.4 describing updating the main-task network parameters using an RL component based on the multi-step truncated return, which corresponds to training the action selection neural network based on the main task rewards using reinforcement learning).
Regarding dependent claim 5, Veeriah, in view of Jaderberg, teach the method of claim 1, further comprising: for each time step in the sequence of time steps: determining, for each auxiliary prediction neural network, the auxiliary reward for the time step based on the value of the corresponding target feature at the time step (see Veeriah Section 2.4 describing updating the answer network parameters using a generalized temporal difference learning algorithm where the target is the discounted sum of cumulants from time t onwards, which corresponds to determining the auxiliary reward (cumulant) based on the target feature); and training each auxiliary prediction neural network based on the corresponding auxiliary rewards (interpreted as based on corresponding auxiliary rewards per the 35 U.S.C. 112(b) rejection set forth above) using reinforcement learning (see Veeriah Section 2.4 describing updating the answer network parameters using a generalized temporal difference learning algorithm, which corresponds to training the auxiliary prediction neural network based on these auxiliary rewards using reinforcement learning).
Regarding dependent claim 6, Veeriah, in view of Jaderberg, teach the method of claim 1, further comprising, at each of one or more time steps in the sequence of time steps: updating, for one or more of the auxiliary prediction neural networks, data that defines the proper subset of the set of features that are designated to be included in the auxiliary input to the auxiliary prediction neural network (see Veeriah Section 2.3 teaching applying an update to the meta-parameters of the question network that parameterizes the cumulant and discount functions defining the GVFs. Because the question network defines the target features/auxiliary rewards for the answer network per Section 2.1, and because Jaderberg Section 3.1 teaches restricting the auxiliary input to a proper subset of features corresponding to the specific auxiliary prediction objective, updating the GVF question parameters correspondingly updates the data defining which features are included in the auxiliary input for the auxiliary prediction neural network).
Regarding dependent claim 7, Veeriah, in view of Jaderberg, teach the method of claim 6, wherein updating the data that defines the proper subset of the set of features that are designated to be included in the auxiliary input to the auxiliary prediction neural network comprises: determining, for each feature in the set of features, a respective first importance score characterizing an importance of the feature to predicting state values relative to the auxiliary reward (see Veeriah Section 2.2, Section 2.3 teaching computing a meta-gradient that evaluates the sensitivity of the meta-loss with respect to the meta-parameters through the updates to the answers, which effectively computes gradient-based signals that characterize the importance of each feature to predicting the state values relative to the auxiliary reward (cumulant)); and updating the data defining the proper subset of the set of features that are designated to be included in the auxiliary input to the auxiliary prediction neural network based on the first importance scores (see Veeriah Section 2.3 teaching updating the question network parameters based on these computed gradients).
Regarding dependent claim 8, Veeriah, in view of Jaderberg, teach the method of claim 7, wherein determining the respective first importance score for each feature in the set of features comprises: obtaining a state value function that is configured to process the set of features to generate a state value estimate for the current state of the environment relative to the corresponding auxiliary reward (see Veeriah Section 2.4 teaching that the answer network approximates the GVF answers y, which act as state value functions relative to the auxiliary rewards (cumulants)); and determining the first importance score for each feature in the set of features using the state value function (see Veeriah Section 2.4 teaching the meta-gradient computation evaluates the sensitivity of these GVF value function outputs to changes in the input features to determine the gradients).
Regarding dependent claim 9, Veeriah, in view of Jaderberg, teach the method of claim 8, wherein the state value function is a linear function that comprises a respective parameter corresponding to each feature in the set of features (see Veeriah Section 2.4 explicitly stating that "functions π, v and y will be linear functions of state xt" and that "each GVF prediction…is separately parameterised…", where the parameters (weights) of the linear function directly correspond to the features),and wherein determining the first importance score for each feature in the set of features using the state value function comprises: determining the first importance score for each feature based on a value of the corresponding parameter of the state value function (see Veeriah Section 2.4 teaching the values of the parameters are used in the gradient computation to determine the importance of each feature).
Regarding dependent claim 10, Veeriah, in view of Jaderberg, teach the method of claim 8, wherein for each time step in the sequence of time steps, the state value function is trained based on the auxiliary reward for the time step using reinforcement learning (see Veeriah Section 2.4 teaching that the answer network parameters ϴy (the state value functions) are updated using a generalized temporal difference learning algorithm based on the cumulants (auxiliary rewards), which is a reinforcement learning technique).
Regarding dependent claim 11, Veeriah, in view of Jaderberg, teach the method of claim 1, further comprising, at each of one or more time steps: updating data that defines the respective target features that specify the auxiliary rewards for the auxiliary prediction neural networks (see Veeriah Section 2.3 teaching applying an update to the meta-parameters of the question network, which parameterizes the cumulant functions that define the GVFs, corresponding to updating data that defines the target features specifying the auxiliary rewards).
Regarding dependent claim 12, Veeriah, in view of Jaderberg, teach the method of claim 11, wherein updating the data that defines the respective target features that specify the auxiliary rewards for the auxiliary prediction neural networks comprises: determining, for each feature in the set of features, a respective second importance score characterizing an importance of the feature to predicting main task rewards (see Veeriah Section 2.3 teaching computing the meta-gradient by evaluating how changes in the GVF question parameters affect the main task loss, which characterizes the importance of the features to predicting main task rewards);and updating data that defines the respective target features that specify the auxiliary rewards for the auxiliary prediction neural networks based on the second importance scores (see Veeriah Section 2.3 teaching updating the meta-parameters based on these computed gradients).
Regarding dependent claim 13, Veeriah, in view of Jaderberg, teach the method of claim 12, wherein determining the respective second importance score for each feature in the set of features comprises: obtaining a main task reward estimation function that is configured to process the set of features representing an observation for a time step to generate a prediction for a main task reward received at a next time step (see Veeriah Section 2.4 teaching that the main task network estimates a state value function v(xt) that predicts the expected future main task rewards);and determining the second importance score for each feature in the set of features using the main task reward estimation function (see Veeriah Section 2.4 teaching the meta-gradient computation evaluates how changes in the GVF questions affect the main task loss, which is defined in terms of this value function v, to determine the gradients).
Regarding dependent claim 14, Veeriah, in view of Jaderberg, teach the method of claim 13, wherein the main task reward estimation function is a linear function that comprises a respective parameter corresponding to each feature in the set of features (see Veeriah Section 2.4 explicitly stating that "functions π, v and y will be linear functions of state xt", where the parameters of the linear state value function v correspond to the features),and wherein determining the second importance score for each feature in the set of features using the main task reward estimation function comprises:determining the second importance score for each feature based on a value of the corresponding parameter of the main task reward estimation function (see Veeriah Section 2.4 teaching the values of the parameters are used in the gradient computation to determine the importance of each feature).
Regarding dependent claim 15, Veeriah, in view of Jaderberg, teach the method of claim 13, wherein for each time step in the sequence of time steps, the main task reward estimation function is trained based on the main task reward for the time step using supervised learning (see Veeriah Section 2.4 teaches that the value function v is trained by minimizing a loss between the predicted value v(xt) and the multi-step truncated return G^vt which includes the main task rewards which constitutes supervised learning using the observed returns as targets).
Regarding dependent claim 17, Veeriah, in view of Jaderberg, teach the method of claim 1, further comprising, prior to the first time step in the sequence of time steps and for each auxiliary prediction neural network (see Veeriah Section 3.3 teaching that the question network parameters are randomly initialized before training begins, which establishes actions occurring prior to the first time step for the auxiliary prediction networks):
selecting a proper subset of the set of features to be included in the auxiliary input to the auxiliary prediction neural network, comprising (see Veeriah Section 3.3 teaching randomly initializing the question network parameters, which define the target features/cumulants. As modified by the teachings of Jaderberg Section 3.1 to restrict the auxiliary input to a proper subset of features corresponding to the target feature, this initialization effectively selects the feature subsets):
randomly sampling a proper subset of the set of features (Because the question network parameters are randomly initialized per Veeriah Section 3.3, the resulting selection of the proper subset of features taught by Jaderberg Section 3.1 constitutes randomly sampling a proper subset of the set of features); and designating the randomly sampled proper subset of the set of features for inclusion in the auxiliary input to the auxiliary prediction neural network (The random initialization of the network parameters inherently designates this sampled subset for inclusion in the auxiliary input to the auxiliary prediction neural network for subsequent processing, as supported by the combined teachings of Veeriah Section 3.3 and Jaderberg Section 3.1).
Regarding dependent claim 18, Veeriah, in view of Jaderberg, teach the method of claim 1, further comprising, prior to the first time step in the sequence of time steps and for each auxiliary prediction neural network: randomly sampling a feature from the set of features; and designating the randomly sampled feature as the target feature corresponding to the auxiliary prediction neural network (see Veeriah Section 3.3 teaching that the question network parameters, which define the cumulants (target features), are randomly initialized before training begins, which corresponds to randomly sampling and designating the target features for the auxiliary prediction neural networks).
Regarding independent claim 19, it is a system claim that is substantially the same as the method of claim 1. Thus, claim 19 is rejected for the same reason as claim 1. In addition, Veeriah teaches a system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform operations for (Section 4 implemented and evaluated on computing hardware with GPUs).
Regarding independent claim 20, it is a computer storage media claim that is substantially the same as the method of claim 1. Thus, claim 20 is rejected for the same reason as claim 1. In addition, Veeriah teaches one or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations for (Section 4 implemented and evaluated on computing hardware with GPUs).
Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Veeriah, in view of Jaderberg, and further in view of Schlegel et al. (hereinafter Schlegel) “General Value Function Networks" (2021).
Schlegel was disclosed in an IDS dated 8/9/2024.
Regarding dependent claim 16, Veeriah, in view of Jaderberg, teach the method of claim 1, wherein: each auxiliary prediction neural network generates a respective state value estimate for the current state of the environment relative to the corresponding auxiliary reward (see Veeriah Section 2.4 teaching that the answer network generates GVF answers, which are state value estimates relative to the cumulants, which corresponds to each auxiliary prediction neural network generating a respective state value estimate for the current state of the environment relative to the corresponding auxiliary reward).
Veeriah and Jaderberg do not explicitly teach that the input to the action selection neural network further comprises the respective state value estimate generated by each auxiliary prediction neural network.
However, Schlegel teaches that input to an action selection neural network further comprises respective state value estimate generated by each auxiliary prediction neural network (see Schlegel Section 4 teaching a General Value Function Network (GVFN) architecture where the internal state components are explicitly constrained to be GVF predictions, meaning each element corresponds to a state value estimate for a multi-step policy-contingent question, Section 14 explicitly teaching applying these predictive networks to a control setting by "using the state of the GVFN as input to an actor-critic algorithm", where an actor-critic algorithm corresponds to the claimed action selection neural network).
Because Veeriah, in view of Jaderberg, and Schlegel address the issue of improving reinforcement learning agent performance and representation learning through auxiliary prediction objectives, accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method of Veeriah and Jaderberg by incorporating the teachings of Schlegel, with a reasonable expectation of success, to include Veeriah’s generated GVF answers (state value estimates) as additional inputs to the main task network (action selection neural network) to teach that the input to the action selection neural network further comprises the respective state value estimate generated by each auxiliary prediction neural network. This modification would have been motivated by the desire to provide the policy network with explicit, rich predictive features about the future state of the environment, as using these predictive features as inputs provides a useful summary of the observed sequence that facilitates more accurate decision-making and control in partially observable environments (Schlegel Section 1, Section 14).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Lavin et al., US2025/0068882A1, (Feb. 27,2025) (ABSTRACT Described are systems for determining domain observations of an environment. Systems may include: a domain engine module, an active sensing module, a fractal network module, and an execution agent module. Modules may be configured to perform methods for determining domain observations of the environment. Methods may include generating or receiving domain observations, generating or receiving sim actions, generating fractal networks associated with the domain observations or the sim actions, generating observation sequences from the fractal networks, and comparing the observation sequences to the domain observations).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KUANG FU CHEN whose telephone number is (571)272-1393. The examiner can normally be reached M-F 9:00-5:30pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jennifer Welch can be reached on (571) 272-7212. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/KC CHEN/Primary Patent Examiner, Art Unit 2143