DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on March 10, 2026 has been entered.
Response to Amendment
This Office Action has been issued in response to Applicant’s Communication of amended application S/N 17/972,291 filed on February 16, 2026. Claims 1, and 3 to 20 are currently pending with the application.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1, and 3 to 21 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Claims 1, 8, and 15 recite producing a set of policies.
The limitation of producing a set of policies, which specifically recites “producing, based on the dataset information and the tradeoff information, a set of possibly optimal policies for the multiple objective environment, wherein the set of possibly optimal policies indicates actions for an intelligent agent operating in the multiple objective environment to take, wherein producing the set of possibly optimal policies is based on each of the set of possibly optimal policies satisfying the tradeoff information for at least one condition”, is a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind, but for the recitation of generic computer components. That is, other than reciting “by a processor” (claims 8 and 15), nothing in the claim element precludes the steps from practically being performed in a human mind. For example, but for the “by a processor” language, “producing”, in the context of this claim encompasses the user mentally, with the aid of pen and paper, writing down a list of candidate best policies or tasks, based on a dataset and tradeoff information. If a claim limitation, under its broadest reasonable interpretation, covers mental processes but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claims recite an abstract idea.
This judicial exception is not integrated into a practical application. In particular, the claims recite the additional elements – “receiving dataset information comprising a plurality of datasets, each dataset comprising state, action, and reward information including a reward value for each of a plurality of different objectives in a multiple objective environment”, “receiving tradeoff information, the tradeoff information comprising an indication that a first vector comprising first values of each of the plurality of different objectives in the multiple objective environment is preferred to a second vector comprising second values of each of the plurality of different objectives in the multiple objective environment”, a non-transitory computer-readable storage memory, and a processor. The “receiving” limitations amount to data-gathering steps which is considered to be insignificant extra-solution activity (See MPEP 2106.05(g)). The non-transitory computer-readable storage memory and processor in these steps are recited at a high-level of generality (i.e., as a generic processor performing a generic computer function) such that it amounts no more than mere instructions to apply the exception using a generic computer component. Accordingly, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claims are directed to an abstract idea.
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The insignificant extra-solution activity identified above, which include the data gathering steps, is recognized by the courts as well-understood, routine, and conventional activity when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity (See MPEP 2106.05(d)(II)(i) Receiving or transmitting data over a network, e.g., using the Internet to gather data, buySAFE, Inc. v. Google, Inc., 765 F.3d 1350, 1355, 112 USPQ2d 1093, 1096 (Fed. Cir. 2014) (computer receives and sends information over a network)). The claims are not patent eligible.
Claim 3 is dependent on claim 1 and includes all the limitations of claim 1. Therefore, claim 3 recites the same abstract idea of claim 1. The claim recites the additional limitations of “receiving a dataset from a simulated environment”, which amounts to data gathering steps, which is considered to be insignificant extra-solution activity (See MPEP 2106.05(g)), and recognized by the courts as well-understood, routine, and conventional activities when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity (See MPEP 2106.05(d) (II)(i) Receiving or transmitting data over a network, e.g., using the Internet to gather data, buySAFE, Inc. v. Google, Inc., 765 F.3d 1350, 1355, 112 USPQ2d 1093, 1096 (Fed. Cir. 2014) (computer receives and sends information over a network)). The claim does not amount to significantly more than the abstract idea. Same rationale applies to claim 4, since it recites limitations that similarly amount to data gathering steps.
Claim 5 is dependent on claim 1 and includes all the limitations of claim 1. Therefore, claim 5 recites the same abstract idea of claim 1. The claim recites the additional limitations of “iteratively receiving additional tradeoff information from user input and calculating a refined set of possibly optimal polices based on the additional tradeoff information”, where the calculating limitation, as presented, can be performed in the human mind with the aid of pen and paper, and therefore, is further elaborating on the abstract idea, and where the receiving limitation amounts to data gathering steps, which is considered to be insignificant extra-solution activity (See MPEP 2106.05(g)), and recognized by the courts as well-understood, routine, and conventional activities when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity (See MPEP 2106.05(d) (II)(i) Receiving or transmitting data over a network, e.g., using the Internet to gather data, buySAFE, Inc. v. Google, Inc., 765 F.3d 1350, 1355, 112 USPQ2d 1093, 1096 (Fed. Cir. 2014) (computer receives and sends information over a network)). The claim does not amount to significantly more than the abstract idea.
Claim 6 is dependent on claim 1 and includes all the limitations of claim 1. Therefore, claim 6 recites the same abstract idea of claim 1. The claim recites the additional limitations of “comparing, using sets of weighting values with each set corresponding to a different condition and each value within each set corresponding to an objective of the plurality of different objectives, the first vector and the second vector by calculating a first sum of products of the sets of weighting values and first objective values of the first vector and a second sum of products of the sets of weighting values and second objective values of the second vector; and adding a first possibly optimal policy corresponding to the first vector to the set of possibly optimal policies when the first sum is greater than the second sum for at least one set of weighting values from the sets of weighting values”, where comparing, calculating, and adding limitations, as presently presented, can be performed in the human mind with the aid of pen and paper, and therefore, are further elaborating on the abstract idea. The claim does not amount to significantly more.
Claim 7 is dependent on claim 1 and includes all the limitations of claim 1. Therefore, claim 7 recites the same abstract idea of claim 1. The claim recites the additional limitation of “visually presenting tradeoff options to a user; receiving a selection of one of the tradeoff options; and refining, based on the selection, the set of possibly optimal policies”, where refining limitation can be performed in the human mind, and is therefore, further elaborating on the abstract idea, and where the presenting limitation amounts to data presentation steps, and the receiving limitation amounts to data gathering steps, which is considered to be insignificant extra-solution activity, (See MPEP 2106.05(g)), and recognized by the courts as well-understood, routine, and conventional activities when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity (See MPEP 2106.05(d)(II)(i) Receiving or transmitting data over a network, e.g., using the Internet to gather data, buySAFE, Inc. v. Google, Inc., 765 F.3d 1350, 1355, 112 USPQ2d 1093, 1096 (Fed. Cir. 2014) (computer receives and sends information over a network); (v) Presenting offers and gathering statistics, OIP Techs., 788 F.3d at 1362-63, 115 USPQ2d at 1092-93)). Therefore, the limitations do not amount to significantly more than the abstract idea.
Claim 21 is dependent on claim 1 and includes all the limitations of claim 1. Therefore, claim 21 recites the same abstract idea of claim 1. The claim recites the additional limitation of “producing the set of possibly optimal policies further comprises executing a multi-objective fitted Q-iteration algorithm”, thus, the claims recite a mathematical concept. If a claim limitation, under its broadest reasonable interpretation in light of the specification encompasses a mathematical calculation, then it falls within the “Mathematical Concepts” grouping of abstract ideas. Accordingly, the claims recite an abstract idea, and therefore, does not amount to significantly more.
Additionally, the claims do not include a requirement of anything other than conventional, generic computer technology for executing the abstract idea, and therefore, do not amount to significantly more than the abstract idea.
Same rationale applies to claims 9 to 14, and 16 to 20 since they recite similar limitations.
Claims 1, and 3 to 21 are therefore not drawn to eligible subject matter as they are directed to an abstract idea without significantly more.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim 1, 2, 3, 6, 8 to 10, 13, 15 to 17, 20, and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Anthony et al. (U.S. Publication No. 2022/0261635) hereinafter Anthony, and further in view of Huang et al. (U.S. Publication No. 2023/0368037) hereinafter Huang.
As to claim 1:
Anthony discloses:
A method, comprising:
receiving dataset information comprising a plurality of datasets, each dataset comprising state, action, and reward information including a reward value for objectives in a multiple objective environment [Paragraph 0019 teaches receiving data characterizing the current state of the environment; Paragraph 0054 teaches training data can include a selected action, a state value (cumulative reward) of the next state of the environment in response to the selected action, and the given input state of the environment; Paragraph 0068 teaches training data includes multiple input state representations and, for each input state representation, an action label that identifies an action that was selected with the improved policy by the best response computation in response to the input state representation; Paragraph 0069 teaches the action label may identify, for each input state representation of the environment, an action selected by the candidate policy that can generate a highest estimated reward computed by the candidate value neural network, when the environment was in the state characterized by the input state representation];
receiving tradeoff information [Paragraph 0043 teaches generating embedding representations of the inputs, therefore, vectors; Paragraph 0055 teaches receiving real actions performed by the other agents and the current state of the environment; Paragraph 0071 teaches training the policy neural network using a supervised learning algorithm, therefore, receiving tradeoff information; Further clarifying, Paragraph 0008 teaches a plurality of candidate actions where each candidate action is associated with a value estimate, where an action to be performed is selected from the candidate actions, based on the highest overall value estimate, in other words, receiving the candidate actions which are values of the objectives, and tradeoff information as represented by the value estimates, and including the highest overall value estimate]; and
producing, based on the dataset information and the tradeoff information, a set of possibly optimal policies for the multiple objective environment, wherein the set of possibly optimal policies indicates actions for an intelligent agent operating in the multiple objective environment to take [Paragraph 0018 teaches selects actions to be performed by a reinforcement learning agent interacting with an environment, while other agents simultaneously interact with the same environment, in other words, a multiple objective environment; Paragraph 0019 teaches selecting actions to be performed by the agent in response to the received data; Paragraph 0062 teaches generating a candidate policy by performing best response computation; Paragraph 0084 teaches generating a candidate policy; Paragraph 0085 teaches defining a probability distribution over a set of candidate actions that can be performed by the agent].
Anthony does not appear to expressly disclose reward information including a reward value for each of a plurality of different objectives in a multiple objective environment; receiving tradeoff information, the tradeoff information comprising an indication that a first vector comprising first values of each of the plurality of different objectives in the multiple objective environment is preferred to a second vector comprising second values of each of the plurality of different objectives in the multiple objective environment; wherein producing the set of possibly optimal policies is based on each of the set of possibly optimal policies satisfying the tradeoff information for at least one condition.
Huang discloses:
reward information including a reward value for each of a plurality of different objectives in a multiple objective environment [Paragraph 0025 teaches the action selection system can receive a reward and cost based on the current state of the environment and the action of the agent at the time step];
receiving tradeoff information, the tradeoff information comprising an indication that a first vector comprising first values of each of the plurality of different objectives in the multiple objective environment is preferred to a second vector comprising second values of each of the plurality of different objectives in the multiple objective environment [Paragraph 0027 teaches a preference vector input comprising a vector of preference settings, each preference setting defining a relative priority of one of the rewards or costs with respect to the others, i.e., a relative priority of a corresponding set of target objectives; Paragraph 0032 teaches fixing the preference neural network parameters and adjusting the action selection policy neural network parameters, in particular to jointly optimize the set of target objectives according to the preferences (priorities) defined by the vector of preference settings];
wherein producing the set of possibly optimal policies is based on each of the set of possibly optimal policies satisfying the tradeoff information for at least one condition [Paragraph 0005 teaches jointly optimizing a set of objectives relating to rewards and costs received during the task whilst also learning preferences, i.e. trade-offs between the rewards and costs, that are most likely to produce policies that satisfy the constraints].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to combine the teachings of the cited references and modify the invention as taught by Anthony, by incorporating reward information including a reward value for each of a plurality of different objectives in a multiple objective environment; receiving tradeoff information, the tradeoff information comprising an indication that a first vector comprising first values of each of the plurality of different objectives in the multiple objective environment is preferred to a second vector comprising second values of each of the plurality of different objectives in the multiple objective environment; wherein producing the set of possibly optimal policies is based on each of the set of possibly optimal policies satisfying the tradeoff information for at least one condition, as taught by Huang [Paragraph 0005, 0025, 0027, 0032], because both applications are directed to reinforcement learning; by receiving tradeoff information indicating a preferred objective, training accuracy is improved, while improving action selection policy (See Huang Para [0033]).
As to claim 2:
Anthony discloses:
receiving the dataset from an offline dataset source [Paragraph 0050 teaches using labeled data that is available].
As to claim 3:
Anthony discloses:
receiving the dataset from a simulated environment [Paragraph 0027 teaches the environment is a simulated environment].
As to claim 6:
The combination of Anthony and Huang discloses:
comparing, using sets of weighting values with each set corresponding to a different condition and each value within each set corresponding to an objective of the plurality of different objectives, the first vector and the second vector by calculating a first sum of products of the sets of weighting values and first objective values of the first vector and a second sum of products of the sets of weighting values and second objective values of the second vector [Anthony - Paragraph 0063 teaches approximate a time-averaged version of the best responses over the one or more preceding iterations; Paragraph 0072 teaches state value for a given state of the environment represents a time discounted sum of rewards to complete a particular task resulting from the environment being in the state, where the system trains the value neural network by adjusting the values of the parameters of the value neural network; Huang – Paragraph 0067 teaches using the Q-value output of each Q-value neural network to optimize the respective target objective, in particular by computing an improved action distribution q.sub.k(a|s, ϵ) for each target objective i.e. an action distribution that improves the Q-values for the target objective; Paragraph 0068 teaches computing the improved action distribution q.sub.k(a|s, ϵ) for a target objective may involve scaling the output of the current action selection policy neural network π.sub.old(a|s, ϵ) by a policy improvement factor for the target objective dependent upon the Q-value output from the Q-value neural network for the target objective, Q.sub.k.sup.old(s, a, ϵ); Paragraph 0069 teaches the policy improvement factor may be viewed as a weight on the action probabilities]; and
adding a first possibly optimal policy corresponding to the first vector to the set of possibly optimal policies when the first sum is greater than the second sum for at least one set of weighting values from the sets of weighing values [Paragraph 0058 teaches at each iteration, the system generates training data that includes actions chosen by an improved policy, where the improved policy is generated using a previous policy and a previous value, and the improved policy is a policy that performs better than the previous policy].
As to claim 21:
Anthony as modified by Huang discloses:
executing a multi-objective fitted Q-iteration algorithm [Paragraph 0042 teaches maintains a set of Q-value neural networks, one for each target objective i.e. one for each of the one or more rewards and one for each of the one or more costs, where each Q-value neural network has a plurality of Q-value neural network parameters and is configured to process an input comprising upon an action, an observation, and a vector of preference settings to generate a Q-value output for defining an action-value score for the action].
Same rationale applies to claims 8 to 10, 13, 15 to 17, and 20, since they recite similar limitations, and are therefore, similarly rejected.
Claims 4, 5, 7, 11, 12, 14, 18, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Anthony et al. (U.S. Publication No. 2022/0261635) hereinafter Anthony, in view of Huang et al. (U.S. Publication No. 2023/0368037) hereinafter Huang, and further in view of Gomez (U.S. Publication No. 2024/0342901).
As to claim 4:
Anthony does not appear to expressly disclose receiving the tradeoff information from a user.
Gomez discloses:
receiving the tradeoff information from a user [Paragraph 0097 teaches feedback of the remote operator in response to four recommended routine actions is returned as a reward selected by a user].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to combine the teachings of the cited references and modify the invention as taught by Anthony, by receiving the tradeoff information from a user, as taught by Gomez [Paragraph 0097], because both applications are directed to reinforcement learning; receiving tradeoff information from the user enhances the flexibility of the system, while improving training accuracy and enhancing efficiency (See Gomez Para [0138]).
As to claim 5:
Anthony discloses:
iteratively receiving additional tradeoff information and calculating a refined set of possibly optimal polices based on the additional tradeoff information [Paragraph 0047 teaches training the policy neural network and the value neural network in an iterative scheme over multiple iterations by iteratively updating the values of the parameters of the policy neural network and the values of the parameters of the value neural network; Paragraph 0053 teaches iteratively using an improved policy to generate training data is the “best response policy iteration” algorithm].
Anthony does not appear to expressly disclose receiving additional tradeoff information from a user.
Anthony as modified by Gomez further discloses:
receiving additional tradeoff information from a user [Gomez - Paragraph 0008 teaches detect a selection state of the remote operator for the presented plurality of action sets, and determining a reward in learning on the basis of the detected selection state of the remote operator].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to combine the teachings of the cited references and modify the invention as taught by Anthony, by receiving the tradeoff information from a user, as taught by Gomez [Paragraph 0008], because both applications are directed to reinforcement learning; receiving tradeoff information from the user enhances the flexibility of the system, while improving training accuracy and enhancing efficiency (See Gomez Para [0138]).
As to claim 7:
Anthony does not appear to expressly disclose visually presenting tradeoff options to a user; receiving a selection of one of the tradeoff options; and refining, based on the selection, the set of possibly optimal policies.
Gomez discloses:
visually presenting tradeoff options to a user [Gomez - Paragraph 0016 teaches presenting the selected plurality of action sets to a remote operator];
receiving a selection of one of the tradeoff options [Gomez - Paragraph 0016 teaches detect a selection state of the remote operator for the presented plurality of action sets]; and
refining, based on the selection, the set of possibly optimal policies [Gomez - Paragraph 0016 teaches determining a reward in learning on the basis of the detected selection state of the remote operator].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to combine the teachings of the cited references and modify the invention as taught by Anthony, by visually presenting tradeoff options to a user; receiving a selection of one of the tradeoff options; and refining, based on the selection, the set of possibly optimal policies, as taught by Gomez [Paragraph 0016], because both applications are directed to reinforcement learning; receiving tradeoff information from the user enhances the flexibility of the system, while improving training accuracy and enhancing efficiency (See Gomez Para [0138]).
Same rationale applies to claims 11, 12, 14, 18, and 19 since they recite similar limitations, and are therefore, similarly rejected.
Response to Arguments
The following is in response to arguments filed on February 16, 2026. Arguments have been carefully and respectfully considered.
Claim Rejections - 35 USC § 101
Applicant’s arguments have been fully and respectfully considered, but are not persuasive.
In regards to claim 1, Applicant argues that “the claims are directed to an improvement in the technical field of machine learning for controlling the actions of intelligent agents. In particular, the claims describe an improvement to producing optimal policies for actions of an intelligent agent in a multiple objective environment using imprecise tradeoff information”.
In response to the preceding argument, Examiner respectfully disagrees, and respectfully submits that it is not clear, from the Applicant’s argument, what is the specific improvement in the functioning of a computer, or the improvement to another technology or technical field, that is achieved with the claimed invention. Based on the preceding argument, it appears that the improvement is related to “producing optimal policies for actions of an intelligent agent in a multiple objective environment using imprecise tradeoff information”, however, it is not clear how “producing a set of possibly optimal policies for the multiple objective environment” constitutes an improvement in the functioning of a computer or another technology. That is, is not clear what specifically is the technical improvement. Moreover, the claims as presently presented are silent in regards to “imprecise tradeoff information”, therefore, it is also not clear how the described improvements correlate with the claims. Therefore, the claims are directed to an abstract idea without significantly more, under the “Mental Processes” grouping of abstract ideas, as further detailed in the rejections above. 101 Rejections are hereby sustained.
Claim Rejections - 35 USC § 103
Applicant’s arguments have been carefully and respectfully considered, but are moot in view of new grounds of rejections, as necessitated by the amendments.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to RAQUEL PEREZ-ARROYO whose telephone number is (571)272-8969. The examiner can normally be reached Monday - Friday, 8:00am - 5:30pm, Alt Friday, EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Sherief Badawi can be reached at 571-272-9782. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/RAQUEL PEREZ-ARROYO/Primary Examiner, Art Unit 2169