Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
The Examiner notes the applicant’s assertion that: “A computer readable storage medium, as that term is used in the present disclosure, is not to be construed as storage in the form of transitory signals per se”, as stated in paragraph 36 of the filed specification.
Claim Objections
Claim(s) 2 is/are objected to because of the following informalities:
In Claim 2, line 2, “current time step” was probably meant to be: the current time step.
Appropriate correction is required.
Response to Amendment
The previous 35 USC 112 rejections on the claims are withdrawn based on the amendments submitted. Based on the broadening of the claims, the previous allowability of the claims are withdrawn.
Response to Arguments
Applicant's arguments have been fully considered but they are not persuasive. In response to applicant’s argument (Remarks pp.9 -10) that the abstract idea is negated by the claims reciting a solution to a technical problem, the Examiner disagrees that the claims recites this solution and simply changes an action based on a threshold (other elements are not recited).
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-18 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1:
All claims are directed towards either a method, a product or a system and thus satisfies Step 1 as falling into one of the statutory categories.
Step 2A, Prong One:
Independent Claim 1 recites (the same analysis applies to similar independent Claims 7 and 13):
computing utilities for respective candidate actions at a current time step, using a return distribution predictor;
this limitation, under its broadest reasonable interpretation, covers concepts that can be performed in the human mind and therefore would fall under the “Mental Processes” groupings of abstract ideas. That is the human mind is capable of computing with pen and paper, objective values or utilities, from a return distribution function or predictor.
computing a utility gap between a utility of a first action at the current time step and a utility of a reference action;
this limitation, under its broadest reasonable interpretation, covers concepts that can be performed in the human mind and therefore would fall under the “Mental Processes” groupings of abstract ideas. That is the human mind is capable of computing a difference or gap between two action functions using evaluation.
computing a threshold at the current time step for the utility gap;
this limitation, under its broadest reasonable interpretation, covers concepts that can be performed in the human mind and therefore would fall under the “Mental Processes” groupings of abstract ideas. That is the human mind is capable of computing a threshold value pertaining to the above difference/gap using evaluation.
determining whether the utility gap is greater than the threshold;
this limitation, under its broadest reasonable interpretation, covers concepts that can be performed in the human mind and therefore would fall under the “Mental Processes” groupings of abstract ideas. That is the human mind is capable of determining whether the difference/gap is greater than the threshold using evaluation.
and changing an action to the first action at the current time step only when the utility gap is above the threshold.
this limitation, under its broadest reasonable interpretation, covers concepts that can be performed in the human mind and therefore would fall under the “Mental Processes” groupings of abstract ideas. That is the human mind is capable of making a determination to change actions based on a difference/gap being above the threshold, using judgement (a person driving a car in traffic, for instance, make these actions/decisions).
Step 2A, Prong Two:
Claim 1 recites the additional elements of (the same analysis applies to similar independent Claims 7 and 13):
adaptively-repeated action selection in reinforcement learning,
this is considered as generally linking the use of the judicial exception to a particular technological environment or field of use – see MPEP 2106.05(h).
The further additional elements of “computer” and/or “processors” as recited in independent claims 7 and 13 are recited at a high-level of generality such that they amount to no more than mere instructions to apply the exception using a generic computer component. Accordingly, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claims are therefore directed to an abstract idea.
Step 2B:
The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements are considered as generally linking the use of the judicial exception to a particular technological environment or field of use – see MPEP 2106.05(h). The further additional elements of “computer” and/or “processors” as recited in these independent claims amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claims are therefore not patent eligible.
Dependent Claims 2-6, and similar Claims 8-12, and 14-18 are considered as generally linking the use of the judicial exception to a particular technological environment or field of use – see MPEP 2106.05(h).
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
Claims 1-18 are rejected under 35 U.S.C. 102(a)(1) and 102(a)(2) as being anticipated by Ostrovski, US 2020/0364557 A1.
Regarding Claim 1, Ostrovski teaches:
A computer-implemented method for adaptively-repeated action selection in reinforcement learning, the method comprising (paragraph 5: “there is provided a method of selecting an action to be performed by a reinforcement learning agent interacting with an environment”):
computing utilities for respective candidate actions at a current time step, using a return distribution predictor (paragraph 59: “The system 100 can implement a “risk-sensitive” action selection policy by selecting actions that are based on more than just the measures of central tendency of the return distributions corresponding to each possible action”. Each possible action represented at its corresponding time step);
computing a utility gap between a utility of a first action at the current time step and a utility of a reference action (paragraph 28: “The system 100 selects actions 102 to be performed by an agent 104 interacting with an environment 106 at each of multiple time steps. At each time step, the system 100 receives data characterizing the current state of the environment”; and paragraph 29: “the state of the environment 106 at the time step (as characterized by the observation 108) depends on the state of the environment 106 at the previous time step and the action 102 performed by the agent 104 at the previous time step”);
computing a threshold at the current time step for the utility gap (paragraph 5: “receiving a current observation characterizing a current state of the environment; for each action of a plurality of actions that can be performed by the agent to interact with the environment: randomly sampling one or more probability values; for each probability value: processing the action, the current observation, and the probability value using a quantile function network having a plurality of network parameters, wherein the quantile function network is a neural network that is configured to process the action, the current observation, and the probability value in accordance with current values of the network parameters to generate a network output that indicates an estimated quantile value for the probability value with respect to a probability distribution over possible returns that would result from the agent performing the action in response to the current observation, wherein a quantile value for a probability value with respect to a probability distribution refers to a threshold value”);
determining whether the utility gap is greater than the threshold (paragraph 5: “the probability value in accordance with current values of the network parameters to generate a network output that indicates an estimated quantile value for the probability value with respect to a probability distribution over possible returns that would result from the agent performing the action in response to the current observation, wherein a quantile value for a probability value with respect to a probability distribution refers to a threshold value”);
and changing an action to the first action at the current time step only when the utility gap is above the threshold (paragraph 5: “determining a measure of central tendency of the one or more estimated quantile values generated by the quantile function network; and selecting an action from the plurality of possible actions to be performed by the agent in response to the current observation using the measures of central tendency for the actions”).
Regarding Claim 2, Ostrovski further teaches:
The computer-implemented method of claim 1, wherein the utility of the reference action is a utility at current time step of the action that has been taken at the previous time step (paragraph 29: “At each time step, the state of the environment 106 at the time step (as characterized by the observation 108) depends on the state of the environment 106 at the previous time step and the action 102 performed by the agent 104 at the previous time step”).
Regarding Claim 3, Ostrovski further teaches:
The computer-implemented method of claim 1, wherein a p-th percentile of utility gaps in last N time steps before the current time step is adopted as the threshold (paragraph 50: “The quantile value for a probability value with respect to a return distribution refers to a threshold return value below which random draws from the return distribution would fall with probability given by the probability value”. The quantile value representative of the p-th percentile).
Regarding Claim 4, Ostrovski further teaches:
The computer-implemented method of claim 3, wherein a value of p and a value of N are predetermined (paragraph 10: “the first estimated quantile value, and (iii) the second estimated quantile value”; and paragraph 28: “The system 100 selects actions 102 to be performed by an agent 104 interacting with an environment 106 at each of multiple time steps”, these multiple time steps can be predetermined).
Regarding Claim 5, Ostrovski further teaches:
The computer-implemented method of claim 1, further comprising: after the adaptively-repeated action selection, updating the return distribution predictor (paragraph 63: “the system 100 may periodically (e.g., after given numbers of training iterations) update the values of the target quantile function network parameters based on the current values of the quantile function network parameters”; and paragraph 65: “FIG. 3 illustrates a quantile function corresponding to the return distribution illustrated by FIG. 2. The quantile function associates a respective quantile value (represented by the vertical axis) with each of multiple possible probability values”).
Regarding Claim 6, Ostrovski further teaches:
The computer-implemented method of claim 1, wherein the first action has a greater impact than the reference action when the utility gap is greater than the threshold, wherein the first action has a similar impact as the reference action when the utility gap is not greater than the threshold (paragraph 5: “selecting an action to be performed by a reinforcement learning agent interacting with an environment, the method comprising: receiving a current observation characterizing a current state of the environment; for each action of a plurality of actions that can be performed by the agent to interact with the environment: randomly sampling one or more probability values; for each probability value: processing the action, the current observation, and the probability value using a quantile function network having a plurality of network parameters, wherein the quantile function network is a neural network that is configured to process the action, the current observation, and the probability value in accordance with current values of the network parameters to generate a network output that indicates an estimated quantile value for the probability value with respect to a probability distribution over possible returns that would result from the agent performing the action in response to the current observation, wherein a quantile value for a probability value with respect to a probability distribution refers to a threshold value below which random draws from the probability distribution would fall with probability given by the probability value; determining a measure of central tendency of the one or more estimated quantile values generated by the quantile function network; and selecting an action from the plurality of possible actions to be performed by the agent in response to the current observation using the measures of central tendency for the actions. A method of this aspect may be implemented by one or more computers”. And paragraph 30: “the system 100 may receive a reward 110 based on the current state of the environment 106 and the action 102 of the agent 104 at the time step. In general, the reward 110 is a numerical value. The reward 110 can be based on any event or aspect of the environment 106. For example, the reward 110 may indicate whether the agent 104 has accomplished a task (e.g., navigating to a target location in the environment 106) or the progress of the agent 104 towards accomplishing a task”).
Claims 7-12 and 13-18 are similar to Claims 1-6 and are rejected under the same rationale as stated above for those claims.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. See PTO-892 for the relevant prior art where for example the NPL of Duan teaches a distribution function/predictor in reinforcement learning.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DAVE MISIR whose telephone number is (571)272-5243. The examiner can normally be reached M-R 8-5 pm, F some hours.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Abdullah Al Kawsar can be reached at 5712703169. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/DAVE MISIR/Primary Examiner, Art Unit 2127