DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Status of Claims
Claims 1-17 and 19-21 are pending in this application.
Claim 18 is cancelled.
Claims 1 and 15 are amended.
Claims 1-17 and 19-21 are presented for examination.
Information Disclosure Statement
The information disclosure statements (IDS) submitted on 8 October 2025 is being considered by the examiner.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1-17 and 19-21 are rejected under 35 U.S.C. 103 as being unpatentable over Li et al. (US Publication 2019/0250568 A1) in view of Hu et al. (US Publication 2018/0266489 A1).
Regarding claim 1, Li teaches a method of providing a reinforcement learning, RL, agent for decision- making to be used in controlling actuators in an autonomous vehicle, the method comprising: a plurality of training sessions, in which the RL agent interacts with a simulated or real-world first environment including the autonomous vehicle (Li: Para. 21, 23, 28; the learning agent is initially trained to control the (real or virtual object) object and/or system within the (real, augmented, or virtual) environment; each training iteration of the training of the learning agent, the learning agent generates a learning (or exploratory) signal and the supervisor agent generates a supervisor signal; reinforcement learning (RL) framework is employed for training; tasks, such as controlling a drone or other autonomous vehicle), such that in each training session environment is identical but has a different initial value and each training session yields a state-action value function Qk (s, a) dependent on state and action (Li: Para. 60, 74, Fig. 6; learning actor network parameters (θ.sup.μ) may be iteratively updated, the environment is reset and an initial state in the state space is received); ……. ; additional training, in which the RL agent interacts with a simulated or real-world second environment including the autonomous vehicle (Li: Para. 7, 21, 23, 28; reinforcement learning; autonomous vehicle; pioneer agent is then trained; the subsequent training of the pioneer agent, under a reduced level of supervisor influence, includes the replay of training data accumulated during the real-time control of the object), wherein the second environment differs from the first environment by an increased exposure to a subset of state-action pairs for which the variability measure indicates a relatively higher uncertainty (Li: Para. 21, 48, 72, 174; when training the pioneer agent, the influence of the supervisor agent may be attenuated via an incremental reduction of the value of the supervisor coefficient; pioneer agent may include replaying at least portions or samples of the training data).
Li doesn’t explicitly teach an uncertainty evaluation including computing a variability measure for values of the plurality of state-action value functions evaluated for one state-action pair corresponding to possible driving decision by the trained RL agent, wherein said driving decision is based on a common state-action value functions Q(s,a) obtained by combining the plurality of state-action value functions Qk(s,a) from the training sessions.
However Hu, in the same field of endeavor, teaches an uncertainty evaluation including computing a variability measure for values of the plurality of state-action value functions evaluated for one state-action pair corresponding to possible driving decision by the trained RL agent (Hu: Para. 94-96; training the number of N number of agents based on the first policy gradient and a second policy gradient and a second critic trained based on the first loss function and a second loss function to learn cooperation between the N number of agents in a multi-agent environment using a Markov game to instantiate a second agent neural network; training the first agent in the single-agent environment may occur prior to training the N number of agents in the multi-agent environment, thereby enhancing the efficiency of the CM3 reinforcement learning), wherein said driving decision is based on a common state-action value functions Q(s,a) obtained by combining the plurality of state-action value functions Qk(s,a) from the training sessions (Hu: Para. 174; the Q-masker may restrict or otherwise “mask off” any set of actions that the agent or simulated autonomous vehicle does not need to explore or learn from their outcomes; prior knowledge about the system to be incorporated directly in to the learning process).
It would have been obvious to one having ordinary skill in the art to modify the training of learning and pioneer agents with state-action pairs in Li (Li: Para. 23, 33) with the training the first agent in the single-agent environment before training multiple agents in a multi-agent environment (Hu: Para. 95) with a reasonable expectation of success because instantiating each agent with the first agent neural network in a pre-trained fashion before training a number of agents in the multi-agent environment enhances the efficiency of the reinforcement learning (Hu: Para. 95).
Regarding claim 2, Li teaches the method of claim 1, further comprising: ……. , wherein the uncertainty evaluation relates to the recorded state-action pairs (Li: Para. 33; training of the learning agent, the state-action pairs; supervisor coefficient value of 0.55; accumulated training data is stored, recorded).
Li doesn’t explicitly teach traffic sampling, in which state-action pairs encountered by the autonomous vehicle are recorded on the basis of at least one physical sensor signal.
However Hu, in the same field of endeavor, teaches traffic sampling, in which state-action pairs encountered by the autonomous vehicle are recorded on the basis of at least one physical sensor signal (Hu: Para. 185; the Q-masker may enable implementation of training of autonomous vehicles on actual autonomous vehicles; the traffic simulator may be instead replaced with a sensor that detects one or more other vehicles).
It would have been obvious to one having ordinary skill in the art to modify the training of learning and pioneer agents with state-action pairs in Li (Li: Para. 23, 33) with the training the first agent in the single-agent environment before training multiple agents in a multi-agent environment (Hu: Para. 95) with a reasonable expectation of success because instantiating each agent with the first agent neural network in a pre-trained fashion before training a number of agents in the multi-agent environment enhances the efficiency of the reinforcement learning (Hu: Para. 95).
Regarding claim 3, Li teaches the method of claim 1, wherein the first and/or the second environment is a simulated environment (Li: Para. 21; the learning agent is initially trained to control the (real or virtual object) object and/or system within the (real, augmented, or virtual) environment).
Regarding claim 4, Li teaches the method of claim 3, wherein the second environment is generated from the subset of state-action pairs (Li: Para. 35; during the training of the pioneer agent, at least portions of the training data observed and accumulated during the training of the learning agent may be replayed).
Regarding claim 5, Li teaches the method of claim1, wherein the state-action pairs in the subset have a variability measure exceeding a predefined threshold (Li: Para. 31, 36; the learning agent was trained to adequately respond under a supervision influence of 50%, but not trained to respond under a supervisor influence of 45%; initially be set to 1.0, and decayed to approximately 0.0).
Regarding claim 6, Li teaches the method of claim1, wherein the additional training includes modifying said plurality of state-action value functions in respective training sessions (Li: Para. 56; pioneer agent is then trained, under the supervision of the supervisor agent; when training the pioneer agent, the influence of the supervisor agent may be attenuated via an incremental reduction of the value of the supervisor coefficient (k)).
Regarding claim 7, Li teaches the method of claim 1, wherein the additional training includes modifying a combined state-action value function representing a central tendency of said plurality of state-action value functions (Li: Para. 56; pioneer agent is then trained, under the supervision of the supervisor agent; when training the pioneer agent, the influence of the supervisor agent may be attenuated via an incremental reduction of the value of the supervisor coefficient (k)).
Regarding claim 8, Li doesn’t explicitly teach wherein the RL agent is configured for tactical decision-making.
However Hu, in the same field of endeavor, teaches wherein the RL agent is configured for tactical decision-making (Hu: Para. 192; autonomous vehicle policy generation system uses deep reinforcement learning to obtain a high-level policy for tactical decision making).
It would have been obvious to one having ordinary skill in the art to modify the training of learning and pioneer agents with state-action pairs in Li (Li: Para. 23, 33) with the training the first agent in the single-agent environment before training multiple agents in a multi-agent environment (Hu: Para. 95) with a reasonable expectation of success because instantiating each agent with the first agent neural network in a pre-trained fashion before training a number of agents in the multi-agent environment enhances the efficiency of the reinforcement learning (Hu: Para. 95).
Regarding claim 9, Li teaches the method of claim 1, wherein the RL agent includes at least one neural network (Li: Para. 38; the learning agent (and/or learning policy), as well as the pioneer agent (and/or pioneer policy), is implemented via deep neural networks).
Regarding claim 10, Li teaches the method of claim 9, wherein the RL agent is obtained by a policy gradient algorithm (Li: Para. 38; actor-critic framework may be adopted in the training of each of the learning and pioneering networks).
Regarding claim 11, Li teaches the method of claim 9, wherein the RL agent is a Q-learning agent (Li: Para. 38; the neural networks are deep Q (referring to the Q-function) networks (DQN)).
Regarding claim 12, Li teaches the method of claim 9, wherein the training sessions use an equal number of neural networks (Li: Para. 54; S may include the same number of dimensions that characterize the state space).
Regarding claim 13, Li doesn’t explicitly teach wherein the initial value corresponds to a randomized prior function, RPF.
However Hu, in the same field of endeavor, teaches wherein the initial value corresponds to a randomized prior function, RPF (Hu: Para. 121; in each episode, N number of agents are emitted at x.sub.i, on randomly selected initial lanes).
It would have been obvious to one having ordinary skill in the art to modify the training of learning and pioneer agents with state-action pairs in Li (Li: Para. 23, 33) with the training the first agent in the single-agent environment before training multiple agents in a multi-agent environment (Hu: Para. 95) with a reasonable expectation of success because instantiating each agent with the first agent neural network in a pre-trained fashion before training a number of agents in the multi-agent environment enhances the efficiency of the reinforcement learning (Hu: Para. 95).
Regarding claim 14, Li teaches the method of claim 1, wherein the variability measure is one or more of: a variance, a range, a deviation, a variation coefficient, an entropy (Li: Para. 21; supervisor coefficient).
Regarding claim 15, Li teaches an arrangement for controlling an autonomous vehicle, comprising: processing circuitry and memory implementing a reinforcement learning, RL, agent configured to interact with a first environment including the autonomous vehicle in a plurality of training sessions (Li: Para. 21, 23, 28, 85; memory, one or more processors; the learning agent is initially trained to control the (real or virtual object) object and/or system within the (real, augmented, or virtual) environment; each training iteration of the training of the learning agent, the learning agent generates a learning (or exploratory) signal and the supervisor agent generates a supervisor signal; reinforcement learning (RL) framework is employed for training; tasks, such as controlling a drone or other autonomous vehicle), each training session the environment is identical but has a different initial value and each training session yields a state-action value function Qk (s, a) dependent on state and action (Li: Para. 60, 74, Fig. 6; learning actor network parameters (θ.sup.μ) may be iteratively updated, the environment is reset and an initial state in the state space is received), ………. , and - initiate additional training, in which the RL agent interacts with a simulated or real-world second environment including the autonomous vehicle (Li: Para. 7, 21, 23, 28; reinforcement learning; autonomous vehicle; pioneer agent is then trained; the subsequent training of the pioneer agent, under a reduced level of supervisor influence, includes the replay of training data accumulated during the real-time control of the object), wherein the second environment differs from the first environment by an increased exposure to a subset of state-action pairs for which the variability measure indicates a relatively higher uncertainty (Li: Para. 21, 48, 72, 174; when training the pioneer agent, the influence of the supervisor agent may be attenuated via an incremental reduction of the value of the supervisor coefficient; pioneer agent may include replaying at least portions or samples of the training data).
Li doesn’t explicitly teach the processing circuitry and memory further implementing a training manager configured to: - estimate an uncertainty evaluation including computing a variability measure for values of the plurality of state-action value functions evaluated for one state-action pair corresponding to a possible driving decision by the trained RL agent, wherein said driving decision is based on a common state-action value function Q(s,a) obtained by combining the plurality of state-action value functions Qk(s,a) from the training sessions.
However Hu, in the same field of endeavor, teaches the processing circuitry and memory further implementing a training manager configured to: - estimate an uncertainty evaluation including computing a variability measure for values of the plurality of state-action value functions evaluated for one state-action pair corresponding to a possible driving decision by the trained RL agent (Hu: Para. 94-96; training the number of N number of agents based on the first policy gradient and a second policy gradient and a second critic trained based on the first loss function and a second loss function to learn cooperation between the N number of agents in a multi-agent environment using a Markov game to instantiate a second agent neural network; training the first agent in the single-agent environment may occur prior to training the N number of agents in the multi-agent environment, thereby enhancing the efficiency of the CM3 reinforcement learning), wherein said driving decision is based on a common state-action value function Q(s,a) obtained by combining the plurality of state-action value functions Qk(s,a) from the training sessions (Hu: Para. 174; the Q-masker may restrict or otherwise “mask off” any set of actions that the agent or simulated autonomous vehicle does not need to explore or learn from their outcomes; prior knowledge about the system to be incorporated directly in to the learning process).
It would have been obvious to one having ordinary skill in the art to modify the training of learning and pioneer agents with state-action pairs in Li (Li: Para. 23, 33) with the training the first agent in the single-agent environment before training multiple agents in a multi-agent environment (Hu: Para. 95) with a reasonable expectation of success because instantiating each agent with the first agent neural network in a pre-trained fashion before training a number of agents in the multi-agent environment enhances the efficiency of the reinforcement learning (Hu: Para. 95).
Regarding claim 16, Li teaches the arrangement of claim 15, …… , wherein the training manager is configured to estimate the uncertainty for the recorded state-action pairs (Li: Para. 33; training of the learning agent, the state-action pairs; supervisor coefficient value of 0.55; accumulated training data is stored, recorded).
Li doesn’t explicitly teach further comprising a vehicle control interface configured to record state-action pairs encountered by the autonomous vehicle on the basis of at least one physical sensor in the autonomous vehicle.
However Hu, in the same field of endeavor, teaches further comprising a vehicle control interface configured to record state-action pairs encountered by the autonomous vehicle on the basis of at least one physical sensor in the autonomous vehicle (Hu: Para. 185; the Q-masker may enable implementation of training of autonomous vehicles on actual autonomous vehicles; the traffic simulator may be instead replaced with a sensor that detects one or more other vehicles).
It would have been obvious to one having ordinary skill in the art to modify the training of learning and pioneer agents with state-action pairs in Li (Li: Para. 23, 33) with the training the first agent in the single-agent environment before training multiple agents in a multi-agent environment (Hu: Para. 95) with a reasonable expectation of success because instantiating each agent with the first agent neural network in a pre-trained fashion before training a number of agents in the multi-agent environment enhances the efficiency of the reinforcement learning (Hu: Para. 95).
Regarding claim 17, Li teaches a non-transitory computer readable medium storing a computer program comprising instructions to cause a processor to perform the method of claim 1 (Li: Para. 89; memory includes instructions; instructions, when executed by processor(s) are configured to cause the computing device to perform any of the operations described herein).
Regarding claim 19, Li teaches the method of claim 10, wherein the poly gradient algorithm is an actor-critic algorithm (Li: Para. 38; actor-critic framework may be adopted in the training of each of the learning and pioneering networks).
Regarding claim 20, Li teaches the method of claim 11, wherein the Q-learning agent is a dep Q network (Li: Para. 38; the neural networks are deep Q (referring to the Q-function) networks (DQN)).
Regarding claim 21, Li doesn’t explicitly teach wherein the plurality of training sessions are performed in parallel.
However Hu, in the same field of endeavor, teaches wherein the plurality of training sessions are performed in parallel (Hu: Para. 75; by having the simulator and critic observe the number of N number of agents, learning for different scenarios may occur in parallel).
It would have been obvious to one having ordinary skill in the art to modify the training of learning and pioneer agents with state-action pairs in Li (Li: Para. 23, 33) with the training the first agent in the single-agent environment before training multiple agents in a multi-agent environment (Hu: Para. 95) with a reasonable expectation of success because instantiating each agent with the first agent neural network in a pre-trained fashion before training a number of agents in the multi-agent environment enhances the efficiency of the reinforcement learning (Hu: Para. 95).
Response to Amendments/Arguments
Applicant’s arguments, filed 15 October 2025, with respect to the rejection of claims 1-17 and 19-20 under 35 U.S.C. 103 have been considered, but are not persuasive.
The applicant’s attorney argues that Hu agents are not trained in an identical environment.
In response to the argument above, Hu teaches the simulator 108 may simulate or perform simulation associated with one or more agents 122, taking one or more actions 124, within a simulation environment 126, where one or more critics 128 interpret or evaluate one or more of the actions 124 taken by one or more of the agents 122 to determine one or more rewards 132 and one or more states 134 resulting from the actions taken (Hu: Para. 55, Fig. 1). Hu’s system tries the first agent and the N number of agents associated with that layout. The system does start training with a multi-agent environment, but only compares the first and N number of agents that went through the same multi-agent environment (Hu: Para. 8).
For prior art Li, the environment is reset and an initial state in the state space is received (Li: Para. 74). This is backed by Figure 6 Line 1, where there is Q(s, a| theta_Q). To compare the state-agents they need to go through the same Q. Also Q is the environment which the change variables s,a. That means the identical environment that different state-agents are sent through (Li: Fig. 6).
The applicant next argues that the action of Hu’s Q-masker 114 does not equate to the “uncertainty evaluation including computing a variability measure for values of the plurality of state-action value functions evaluated for one state-action pair” in claim 1.
In response to the argument above, Hu teaches training the number of agents based on gradients (Hu: Para. 95). Gradients, the rate of change, are a measure of variability for the agents. Hu teaches a loss function to learn cooperation between the number of agents in the environment (Hu: Para. 95). The loss function is an uncertainty evaluation tied to the gradients of the number of agents. Hu teaches a critic based on the loss function of the Markov decision process (Hu: Para. 94). The training of agents creates gradients to show the variability. The loss functions tied to the gradients is the uncertainty evaluation. The Markov decision process is based on the loss function, where the information created for the possible driving decisions, agents, forming a cooperative multi-goal, multi-agent, multi-stage reinforcement learning (Hu: Para 96). Hu’s system does equate to the “uncertainty evaluation including computing a variability measure for values of the plurality of state-action value functions evaluated for one state-action pair.”
The applicant next argues that claim 1 selects which state-action pairs are to be included in “additional training”.
In response to the argument above, Li teaches training the pioneer agent may include replaying at least portions or samples of the training data (Li: Para. 72, Fig. 6). Li teaches the Q-masker may restrict or otherwise “mask off” any set of actions that the agent or simulated autonomous vehicle does not need to explore or learn from their outcomes (Para. 174). A reward is calculated for each action in the training set. The driving action that doesn’t create a good vehicle movement outcome has a lower reward and is restricted by the Q-masker, making that driving action not an action put through the subsequent additional training. Not all of the possible driving options will move to the subsequent additional training.
The applicant next argues that there is no basis for computing “a variability measure for values of the plurality of state-action value functions evaluated for one state-action pair” like claim 1 requires.
In response to the argument above, Hu teaches training the number of agents based on gradients (Hu: Para. 95). Gradients, the rate of change, are a measure of variability for the agents. Hu teaches a loss function to learn cooperation between the number of agents in the environment (Hu: Para. 95). The loss function is an uncertainty evaluation tied to the gradients of the number of agents. Hu teaches a critic based on the loss function of the Markov decision process (Hu: Para. 94). The training of agents creates gradients to show the variability. The loss functions tied to the gradients is the uncertainty evaluation. The Markov decision process is based on the loss function, where the information created for the possible driving decisions, agents, forming a cooperative multi-goal, multi-agent, multi-stage reinforcement learning (Hu: Para 96). Hu’s system does equate to the “uncertainty evaluation including computing a variability measure for values of the plurality of state-action value functions evaluated for one state-action pair.”
The applicant next argues that there is no incentive to combine Li and Hu.
In response to applicant’s argument that there is no teaching, suggestion, or motivation to combine the references, the examiner recognizes that obviousness may be established by combining or modifying the teachings of the prior art to produce the claimed invention where there is some teaching, suggestion, or motivation to do so found either in the references themselves or in the knowledge generally available to one of ordinary skill in the art. See In re Fine, 837 F.2d 1071, 5 USPQ2d 1596 (Fed. Cir. 1988), In re Jones, 958 F.2d 347, 21 USPQ2d 1941 (Fed. Cir. 1992), and KSR International Co. v. Teleflex, Inc., 550 U.S. 398, 82 USPQ2d 1385 (2007).
In this case, Li teaches a reinforcement learning framework with state-action pairs (Li: Para. 21, 28). Hu teaches a Q-masker where action-states that should not be considered. If the simulated autonomous vehicle is positioned in the left most lane, then taking a left lane change action would result in getting off the highway (Hu: Para. 174). It would be obvious to one of ordinary skill in the art to modify Li with Hu because instantiating each agent with the first agent neural network in a pre-trained fashion before training a number of agents in the multi-agent environment enhances the efficiency of the reinforcement learning (Hu: Para. 95).
The applicant next argues that “Claim 5 recites additional elements which distinguish over the cited combination of Li and Hu, because Hu does not disclose an uncertainty threshold for determining which state- action pairs are to participate in the ‘additional training’”.
In response to the argument above, Applicant's arguments fail to comply with 37 CFR 1.111(b) because they amount to a general allegation that the claims define a patentable invention without specifically pointing out how the language of the claims patentably distinguishes them from the references.
The applicant’s arguments have failed to point out the distinguishing characteristics of the amended claim language over the prior art. For the above reasons, Li’s reinforcement learning with Hu’s gradient reads on the applicant’s uncertainty-directed training of a reinforcement learning agent for tactical decision-making. The rejection is maintained.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LAURA E LINHARDT whose telephone number is (571)272-8325. The examiner can normally be reached on M-TR, M-F: 8am-4pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Angela Ortiz can be reached on (571) 272-1206. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/L.E.L./Examiner, Art Unit 3663
/ANGELA Y ORTIZ/Supervisory Patent Examiner, Art Unit 3663