Last updated: April 19, 2026
Application No. 18/054,260
METHOD FOR CONTROLLING AN AGENT

Final Rejection §103
Filed
Nov 10, 2022
Examiner
HUANG, YAO D
Art Unit
2124
Tech Center
2100 — Computer Architecture & Software
Assignee
Robert Bosch GmbH
OA Round
2 (Final)
This examiner grants 63% of cases after interview

— +31.9% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 124 resolved cases, 2023–2026
Examiner Intelligence

HUANG, YAO D View full profile →
Grants 63% of resolved cases
Career Allow Rate
78 granted / 124 resolved
+7.9% vs TC avg
Strong +32% interview lift
Without
With
+31.9%
Interview Lift
resolved cases with interview
Typical timeline
3y 11m
Avg Prosecution
18 currently pending
Career history
142
Total Applications
across all art units
Statute-Specific Performance

§101
17.6%
-22.4% vs TC avg
§103
47.1%
+7.1% vs TC avg
§102
9.5%
-30.5% vs TC avg
§112
22.9%
-17.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 124 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Remarks
	This Office Action is in response to applicant’s amendment filed on January 22, 2026, under which claims 1-6 and 8-9 are pending and under consideration.

Response to Arguments
	Applicant’s amendments have overcome the previous claim rejections under § 112 and § 101. Therefore, these rejections have been withdrawn.
	Applicant’s arguments directed to the § 103 rejections have been fully considered but are not deemed to be persuasive. In the present amendment, applicant has incorporated the features of original dependent claim 7 into claim 1. However, the amended claim 1 remains rejected over the same reference as applied in the previous office action.
	Applicant argues: 
This feature recites training that is based on representations that combine training data based on the real world and training data based on simulation. The Patent Office maps this claim to [0058] of Hafner, but [0058] teaches away from this claimed feature. [0058] reads as follows:

…

Since the above blurb, especially the highlighted portion, discloses the avoiding of risks associated with training an agent based on real world data by using a simulated environment, it follows that Hafner is not disclosing the claimed combined real world and simulated training data as recited in claim 7. Therefore, since Hafner teaches away from the combined real world/simulated training approach, and neither Farid nor Ailshire overcomes this deficiency, withdrawal of this rejection is requested.

(Applicant’s response, page 7).
	These arguments are not persuasive for the following reasons. In response to the assertion that “Hafner is not disclosing the claimed combined real world and simulated training data as recited in claim 7,” the Examiner notes that the portion addressed in applicant’s response was not the only part that the Examiner cited in the previous (and current) action for the limitations at issue. For example, the Examiner also cited paragraph [0051], which teaches the implementation that “the environment is a real-world environment and the agent is a mechanical agent interacting with the real-world environment.” As such, applicant’s statement that Hafner is not disclosing the claimed limitation at issue does not address the full rejection.   
	In regards to applicant’s argument that Hafner teaches away from the combined real world/simulated training approach, this is not persuasive because Hafner does not teach away from such a combination. Instead, paragraph [0058] merely states that training in a simulated environment can have the benefit of avoiding risks associated with the real-world. Thus, this does not necessarily mean that real-world interactions are never used, but only that reliance on real-world interactions can be reduced. Therefore, there is no teaching away from the use of real-world interactions as data for training.
	In general, paragraph [0034] of Hafner teaches that “the set of training tuples 104 can be derived from experience information generated as a consequence of the interaction of the agent or another agent with the environment or with another instance of the environment for use in training the neural networks” (emphasis added), while [0051] teaches that “the environment is a real-world environment” and [0052] provides examples of “sensor data” from real world environments. Therefore, Hafner teaches the use of real-world environment in addition to teaching the use of a simulated environment, and the teachings of Hafner, along with the other references not at issue, render obvious the claimed invention.
	Therefore, the claims remain rejected over the previously applied references. Moreover, the Examiner notes that use of both simulated and real environments for training data is well known in the art, as evidenced by Co-Reyes et al. (made of record in the previous action), and additional prior art documents made of record but not relied upon.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

1.	Claims 1, 6 and 8-9 are rejected under 35 U.S.C. 103 as being unpatentable over Hafner et al. (US 2021/0158162 A1) (“Hafner”) in view of Farid et al., “Few-Shot System Identification for Reinforcement Learning,” 2021 6th Asia-Pacific Conference on Intelligent Robot Systems (ACIRS), Tokyo, Japan, 16-18 July 2021, pp. 1-7 (“Farid”) and Allshire et al., “LASER: Learning a Latent Action Space for Efficient Reinforcement Learning,” 2021 IEEE International Conference on Robotics and Automation (ICRA 2021) May 31 - June 4, 2021, Xi'an, China (“Allshire”).
As to claim 1, Hafner teaches a method for controlling an agent, comprising the following steps: [[0006]: “This specification describes a system implemented as computer programs on one or more computers in one or more locations that trains a policy neural network that can be used to control a reinforcement learning agent interacting with an environment by selecting actions to be performed by the agent.”]
collecting training data for multiple representations of states of the agent; [[0034]: “the training system 100 maintains a set of training tuples 104. …the set of training tuples 104 can be derived from experience information generated as a consequence of the interaction of the agent or another agent with the environment or with another instance of the environment for use in training the neural networks.” [0036]: “each training tuple can include (i) a respective current observation characterizing a respective current state of the environment.” In regards to the limitation of the states being “states of the agent,” the “environment” includes the agent. See [0053] (“the observations may include data characterizing the current state of the robot, e.g., one or more of: joint position, joint velocity, joint force, torque or acceleration, e.g., gravity-compensated torque feedback, and global or relative pose of an item held by the robot.”); [0054] (“In the case of a robot or other mechanical agent or vehicle the observations may similarly include one or more of the position, linear or angular velocity, force, torque or acceleration, and global or relative pose of one or more parts of the agent.”).]
training, using the training data: for each representation of the representations, a state encoder for mapping states to latent states in a latent state space, [[0097]: “FIG. 6 is an example illustration of training a representation neural network, a reward neural network, and a transition neural network.” [0098]: “For each training tuple, the system processes, by using the representation neural network 610 and in accordance with current values of the representation network parameters, the training tuple, e.g., a training tuple including observation 606B, to generate a latent representation, e.g., latent representation 612B.” That is, the representation neural network corresponds to a state encoder.] […]; and a transition model, shared for the representations, for latent states, and a Q function model, shared for the representations, for latent states using the state encoder, […], [A “transition model” is disclosed in the form of a transition neural network in [0093]: “The system processes the current latent representation and the selected action using a transition neural network and in accordance with the current values of transition network parameters to generate an updated latent representation (506).” In regards to a “Q function model,” this is generally taught in the form of a function that computes the Q value, as disclosed in [0047]: “The Q value for an action is an estimate of a ‘return’ that would result from the agent performing the action in response to the current observation 210 and thereafter selecting future actions performed by the agent 206 in accordance with current values of the policy network parameters.” Furthermore, [0048] teaches: “A return refers to a cumulative measure of ‘rewards’ received by the agent, for example, a time-discounted sum of rewards.” [0030] teaches the use of a reward neural network ([0030]: “The reward neural network 140 is configured to process a reward network input including a latent representation, e.g., latent representation 112B, to generate a reward network output that specifies a reward, e.g., reward 132B, to be received by the agent…”), and this network is trained, as disclosed in [0097] quoted above, and also in [0105]. Here, the reward neural network is considered to be part of the “Q function model,” since the Q-value is based on the rewards. Additionally, the policy neural network can also be regarded as part of the “Q function model,” and this is also trained, as disclosed in [0039]: “the training engine 160 trains, by using reinforcement learning techniques, the policy neural network 120 and the value neural network 130 on the ‘imagined’ trajectory data generated using the representation, reward and transition neural networks and based on processing information contained in the training tuple set. In particular, the training engine 160 trains the policy neural network 120 to generate action selection outputs that can be used to select actions that maximize a cumulative measure of rewards received by the agent and that cause the agent to accomplish an assigned task.” Note that this paragraph teaches that the policy neural network operates as a function of the reward neural network. Thus, both components may be regarded as part of the Q function model. As such, the function of computing the Q value is trained using the training data in that its subcomponents are trained.]
receiving a state of the agent in one of the representations for which a control action is to be ascertained; [[0042]: “At each time step, the representation neural network 220 processes an observation 210 characterizing the current state of the environment 208 in accordance with the current values of a set of its network parameters 214 (“representation network parameters”) to generate a latent representation 222 of the observation 210…”]
mapping the state to one or more latent states with using the state encoder for the one of the representations; [[0042]: “At each time step, the representation neural network 220 processes an observation 210 characterizing the current state of the environment 208 in accordance with the current values of a set of its network parameters 214 (“representation network parameters”) to generate a latent representation 222 of the observation 210…”]
determining Q values for the one or more of latent states for a set of actions using the Q function model;]
selecting a control action having the best Q value from the set of actions as the control action; [[0047]: “the action selection output 212 may include a respective Q-value for each action in the set of possible actions that can be performed by the agent…The action with the highest Q-value could also be selected as the action to be performed by the agent.”] and
controlling the agent according to the selected control action, [[0042]: “an action 204 to be performed by the agent 206 in response to the observation.” Note that this describes controlling an agent. See also [0042]: “FIG. 2 illustrates an example data flow for using a policy neural network to select actions to be performed by an agent interacting with an environment at each of multiple time steps.”] wherein the representations have a first representation, which is a representation of states in a real world and for which training data are collected through an interaction of the agent with the real world, and [Hafner, [0034]: “Generally the set of training tuples 104 can be derived from experience information generated as a consequence of the interaction of the agent or another agent with the environment or with another instance of the environment for use in training the neural networks.” Hafner, [0051]: “In some implementations, the environment is a real-world environment and the agent is a mechanical agent interacting with the real-world environment, e.g., a robot or an autonomous or semi-autonomous land, air, or sea vehicle navigating through the environment.”] the representations have a second representation, which is a representation of states in a simulation and for which training data are collected through a simulated interaction of the agent with a simulated environment. [[0058]: “In some implementations the environment is a simulated environment and the agent is implemented as one or more computers interacting with the simulated environment. Training an agent in a simulated environment may enable the agent to learn from large amounts of simulated training data while avoiding risks associated with training the agent in a real world environment, e.g., damage to the agent due to performing poorly chosen actions.” [0059]: “The simulated environment may be a motion simulation environment, e.g., a driving simulation or a flight simulation, and the agent may be a simulated vehicle navigating through the motion simulation.” While real-world and simulations environments are discussed separately in the above-cited parts of this reference, [0124] teaches that features of different exemplary embodiments can be combined. [0124]: “Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.” As such, the scope of this prior art reference is understood to cover the case in which the training data is derived from both real-world and simulated environments as described in Hafner (see MPEP § 2144.01). Alternatively, if applicant disagrees, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined training data derived from interactions with both real-world and simulated environments as described in Hafner so as to arrive at the limitations of the instant dependent claim. Doing so would have been an obvious combination of prior art elements according to known methods to yield predictable results (MPEP § 2143(I)(A)) for the following reasons: (1) the prior art included each element claimed, although not necessarily in a single prior art reference, with the only difference between the claimed invention and the prior art being the lack of actual combination of the elements in a single prior art reference (as discussed above); (2) one of ordinary skill in the art could have combined the elements as claimed by known methods, and that in combination, each element merely performs the same function as it does separately (since training data derived from both real-world and simulated training data are used for the same purpose, given that these are merely different feasible types of environments from which training data can be collected; this is especially the case [0051] and [0059] describe similar subjects); and (3) one of ordinary skill in the art would have recognized that the results of the combination were predictable (namely the use of additional functionally equivalent training data, which results in additional training data being used to train the model).]
	Hafner does not explicitly teach “a state decoder for mapping latent states back from the latent state space, an action encoder for mapping actions to latent actions in a latent action space, and an action decoder for mapping latent actions back from the latent action space” and the limitation of “using…the state decoder, the action encoder and the action decoder” as recited in the “training” process of the instant claim.
	Farid, which generally pertains to reinforcement learning (see title), teaches “a state decoder for mapping latent states back from the latent state space” and “using…the state decoder” [§ III.A, paragraph 2: “The model is split into three main parts: the state encoder, the dynamics model, and the state decoder. The states are encoded using a neural network ϕenc as a flexible feature extractor. The importance of finding relevant feature encodings is to act as a kernel that lifts the features from a space where the dynamics is non-linear into a space of linear dynamics… The final part is the decoder network ϕdec which transforms the features from the encoded space back to their original one.” The encoded space is analogous to the “latent state space” of the instant claim. See also § II.D. Note that the decoder is a neural network (see § IV.A, paragraph 1: “All neural networks used, including the state encoder, state decoder, and the network inferring the actual dynamic parameters from the latent dynamics generated by the RNN dynamics encoder”), and is trained (see § III.A, paragraph 2: “the predictive model is trained in a similar manner as with the VRNN using”) and the training uses the state decoder, as shown in equations (3), where the decoder output xti is part of the loss function (3) that is minimized. Note that training the model in this case teaches a Q function model “using…the state decoder,” since the Q-value in the base reference is dependent on all model components.]
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Hafner with the teachings of Farid by implementing a decoder as taught in Farid and to train and use it as part of the training process of Hafner so as to arrive at the limitations of “a state decoder for mapping latent states back from the latent state space” and “using…the state decoder” for the training process as recited in the instant claim. The motivation for doing so would have been to enable transformation of features back to the original (see parts of Farid cited above) in the manner of an autoencoder, which enables the ability to learn the probability distribution of a trajectory (see Farid, § II.D: “The ability to learn the probability distribution of the entire trajectory is essential in this work for time-variant and time-invariant systems and can be achieved in a similar manner
as in VRNN with changing the latent space to represent the dynamics instead of the states.”).  
	The combination of references thus far does not explicitly teach “an action encoder for mapping actions to latent actions in a latent action space, and an action decoder for mapping latent actions back from the latent action space” and the limitation of “using… the action encoder and the action decoder” as recited in the “training” process of the instant claim.
	Allshire, which pertains to reinforcement learning (see title), teaches “an action encoder for mapping actions to latent actions in a latent action space, and an action decoder for mapping latent actions back from the latent action space” and “using… the action encoder and the action decoder.” [§ IV.A, paragraphs 2-3: “Based on these insights, in LASER we propose to learn a representation mapping between original and latent action spaces with a encoder-decoder neural architecture conditioned on the current state of the robot…The action encoder of LASER is a variational neural network EθE : A×Sr → Ā parameterized by θE that encodes an action a in the original action space A, conditioned on the current robot state sr, into a latent action. The function f: Ā → A defined in Sec. III for mapping from latent actions to control inputs in the original action space will be represented by a latent state-dependent variational decoder neural network, DθD: Sr × A¯ → A parameterized by θD, where â = DθD(sr, ā) is the reconstruction of a, an action in the original space that would have resulted in ā ∼ EθE(a).” The encoder and decoder are trained as described in § IV.B, paragraph 3 (“The decoder DθD will be trained to reconstruct an action”), paragraph 4 (“The KL loss ensures the encoder learns a smooth latent space distribution with zero mean”), and associated equations.] and the limitation of “using… the action encoder and the action decoder.” [§ III, last paragraph: “Assuming that LASER has found an optimal mapping, f, from the latent action space to the original action space, a RL policy π: S → Ā would be able to explore the optimal region of the original action space by acting in the latent action space of the lifted latent MDP Ā.” That is, the autoencoder (which includes both the encoder and decoder) is used for the policy, and is thus used as part of the overall reinforcement learning system.]
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of the references thus far with the teachings of Allshire by implementing the use of an action encoder and decoder as taught by Allshire, so as to arrive at the limitations “an action encoder for mapping actions to latent actions in a latent action space, and an action decoder for mapping latent actions back from the latent action space” and the limitation of “using…the action encoder and the action decoder” as recited in the “training” process of the instant claim. The motivation would have been to implement policy learning in a latent action space, which improve efficiency from better alignment of the action space to the task space (see Allshire, abstract: “We evaluate LASER on two contact-rich robotic tasks in simulation, and analyze the benefit of policy learning in the generated latent action space. We show improved sample efficiency compared to the original action space from better alignment of the action space to the task space…”).

As to claim 6, the combination of Hafner, Farid, and Allshire teaches the method as recited in claim 1, wherein the state encoder […] map to a respective probability distribution. [In regards to the state encoder, see Hafner, [0023]: “While the observation 106 can typically include high-dimensional sensor data, e.g., image or video data, that characterizes a current state of an environment, the latent representation 112 of an observation can be represented as an ordered collection of numerical values, e.g., a vector or matrix of numerical values, and generally has a lower dimensionality than the observation itself. For example, each observation can be an image that has 64×64×3 pixels, whereas each latent representation can be a 30-dimensional array that specifies a probability distribution of a set of 30 possible latent variables.”]
Allshire further teaches the limitation of “the action encoder” also mapping to a “respective probability distribution” [Allshire, § IV.B, first full paragraph on page 6653 right column, teaches: “Finally, we also include a regularization component to the loss in the form of a Kullback–Leibler (KL) divergence term, as is common in variational autoencoder architectures [33]. The KL loss ensures the encoder learns a smooth latent space distribution with zero mean: [see equation (3) in the text].” That is, as shown in equation (3), the encoder maps the action to a normal distribution approaching N(0,1), particularly given the context of a variation autoencoder.]
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have further combined the teachings of the references combined thus far, including the teachings of Allshire noted above, so as to arrive at the claimed invention of the instant dependent claim. Since the teachings of Allshire are part of the techniques of the action encoder and decoder discussed in the rejection of the parent claim, the motivation for doing so is the same as the motivation given for the teachings of Allshire discussed in the rejection of the parent claim.

As to claim 8, this claim is directed to a control device for performing the same or substantially the same operations as those recited in claim 1. Therefore, the rejection made to claim 1 is applied to claim 8.
Furthermore, Haftner teaches a “control device configured to control an agent, the control device configured to” [[0006]: “This specification describes a system implemented as computer programs on one or more computers in one or more locations that trains a policy neural network that can be used to control a reinforcement learning agent interacting with an environment by selecting actions to be performed by the agent.”]  

As to claim 9, this claim is directed to a computer readable medium for performing the same or substantially the same operations as those recited in claim 1. Therefore, the rejection made to claim 1 is applied to claim 9.
Furthermore, Haftner teaches a “non-transitory computer-readable medium on which are stored instructions for controlling an agent, the instructions, when executed by a computer, causing the computer to perform the following steps” [[0111]: “Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non transitory storage medium for execution by, or to control the operation of, data processing apparatus.”]  

2.	Claims 2-3 are rejected under 35 U.S.C. 103 as being unpatentable over Hafner in view of Farid and Allshire, and further in view of Van der Pol et al., “Plannable Approximations to MDP Homomorphisms: Equivariance under Actions,” arXiv:2002.11963v1 [cs.LG] 27 Feb 2020 (“Van der Pol”).
As to claim 2, the combination of Hafner, Farid, and Allshire teaches the method as recited in claim 1, as set forth above, but does not explicitly teach the remaining limitations of the instant dependent claim.
	Van der Pol, which relates to reinforcement learning techniques (see abstract), teaches the further limitations of “wherein the training is carried out using a loss function which has a loss that provides a reward when it is highly likely that the latent transition model supplies transitions between latent states to which the state encoder maps states that have transitioned into one another in the training data.” [§ 3.1, paragraph 1: “We represent the mapping Z using a neural network parameterized by θ, whose output will be denoted Zθ. This function maps a state s ∈ S to a latent representation z ∈ 𝒵 ⊆ ℝD. We additionally approximate the abstract transition T̄ by a function T̄ϕ : 𝒵 ×                         
                            
                                
                                    A
                                
                                -
                            
                        
                    → 𝒵 parameterized by ϕ, the abstract rewards R̄ by a neural network R̄ζ : 𝒵 → R, parameterized by ζ, that predicts the reward for an abstract state. … During training, we first sample a set of experience tuples D = {(st, at, rt, st+1)}Nn=1 by rolling out an exploration policy πe for K trajectories. To learn representations that respect Eq. 4 and 5, we minimize the distance between the result of transitioning in observation space, and then mapping to 𝒵, or first mapping to 𝒵 and then transitioning in latent space (see Figure 1). Additionally, the distance between the observed reward R(s) and the predicted reward R̄ζ(Zθ(s)) is minimized. We thus include a general reward loss term. We write s′n = T(sn, an), zn = Zθ(sn), and minimize … [see equation (7) in text] by randomly sampling batches of experience tuples from D. In this paper, we use d(z, z′) = ½(z − z′)2 to model distances in 𝒵 ⊆ RD. Here, T̄ϕ is a function that maps a point in latent space z ∈ Z to a new state z′ ∈ Z by predicting an action-effect that acts on z.” That is, equation 7 shows a “loss function.” Given that Z is the encoder, Zθ(s′n) represents the latent state mapped by the encoder from state s′n which in turn is the transition from state sn. Furthermore, T̄ϕ here is a “latent transition model” which maps the point zn to a new state z’.  In equation 7, which is the loss function, the distance between the latent Zθ(s′n) and the latent state T̄ϕ is minimized, such that the transitions between latent states z to z’ supplied by the latent transition model T̄ϕ is made to highly likely to correspond to the latent states sn to s′n mapped by the state encoder Z.]
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of the references combined thus far with the teachings of Van der Pol by implementing the technique of learning state representations using a loss function, as described in Van der Pol, so as to arrive at the limitations of the instant dependent claim. The motivation for doing so would have been to provide a latent transition model and a method for training that model to learn state representations that respect action-equivariance (see Van der Pol, § 3.1, paragraph 1, first sentence: “Here we show how to learn state representations that respect action equivariance.”), which results in better representations in fewer epochs (see Van der Pol, abstract: “Empirically, we show that in such MDPs, we obtain better representations in fewer epochs compared to representation learning approaches using reconstructions, while generalizing better to new goals than model-free approaches.”).

As to claim 3, the combination of Hafner, Farid, Allshire, and Van der Pol teaches the method as recited in claim 2, as set forth above.
Van der Pol further teaches “wherein the loss function has a locality condition term which penalizes large distances in the latent state space between probable transitions between latent states.” [As noted in the rejection of the parent dependent claim, § 3.1 teaches: “To learn representations that respect Eq. 4 and 5, we minimize the distance between the result of transitioning in observation space, and then mapping to 𝒵, or first mapping to 𝒵 and then transitioning in latent space (see Figure 1)…we use d(z, z′) = ½(z − z′)2 to model distances in 𝒵 ⊆ RD.” Here, 𝒵 ⊆ RD is the latent state space, and the first term in the loss function (7) specifies the minimization of the first distance term d, ensures that the transition between latent states zn and T̄ϕ conforms to the transition to Zθ(s′n). As such, accurate (probable) outputs of the latent transition model T̄ϕ are closer to one another than to non-probable outputs. This is also consistent with the use of a contrastive loss function as described in § 3.1, paragraph 2: “To prevent trivial solutions, we use a contrastive loss, maximizing the distance between the latent next state and the embeddings of a set of random other states.”]
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have further combined the teachings of the references combined thus far, including the teachings of Van der Pol noted above, so as to arrive at the claimed invention of the instant dependent claim. Since the teachings of Van der Pol are part of the technique of learning state representations discussed in the rejection of the parent claim, the motivation for doing so is the same as the motivation given for the teachings of Van der Pol discussed in the rejection of the parent claim.

3.	Claims 4-5 are rejected under 35 U.S.C. 103 as being unpatentable over Hafner in view of Farid and Allshire, and further in view of Shuai et al., “Double Deep Q-learning Based Real-Time Optimization Strategy for Microgrids,” arXiv:2107.12545v1 [eess.SY] 27 Jul 2021 (“Shuai”).
As to claim 4, the combination of Hafner, Farid, and Allshire teaches the method as recited in claim 2, as set forth above, but does not explicitly teach the further limitations of the instant dependent claim.
Shuai, which generally pertains to reinforcement learning techniques, specifically Double Deep Q-learning (see title), teaches “wherein the loss function has a reinforcement-learning loss for the shared Q function model.” [§ III.C, Algorithm 1, teaches loss function in line 10. This is a reinforcement-learning loss because it represents the “the error between the target value and the evaluate value” (§ III.B, text above equation 33), where the target value is yj, as shown in equation (33) and Q is the evaluate value, corresponding to the output of a Q value function. Note that this is specifically a double deep Q-network loss. See Algorithm 1, heading: “Algorithm 1 Training of Double Deep Q Network with experience replay.”]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of the references combined thus far with the teaching of Shuai by implementing a Double Deep Q Network with experience replay and its corresponding loss function as taught in Shuai, so as to arrive at the limitations of the instant dependent claim. The motivation would have been to implement a technique that enables learning a near-optimal strategy only from the historical data, in a manner that outperforms existing methods (see Shuai, abstract: “Then, a double deep Q network (DDQN) based architecture is designed to solve the MINLP problem. The proposed approach can learn a near-optimal strategy only from the historical data. The effectiveness of the proposed algorithm is validated by the simulations on a 10-bus microgrid system and a modified IEEE 69-bus microgrid system. The numerical simulation results demonstrate that the proposed approach outperforms several existing methods.”).

As to claim 5, the combination of Hafner, Farid, Allshire, and Shuai teaches the method as recited in claim 4, “wherein the reinforcement learning loss is a double deep Q-network loss.” [As discussed in the rejection of the parent dependent claim, Shuai teaches a double deep Q-network loss, and such was incorporated into the combination of references as set forth in the rejection of the parent dependent claim. See Algorithm 1, heading: “Algorithm 1 Training of Double Deep Q Network with experience replay.” That is, this feature is covered by the combination of Shuai with the other references and associated motivation for doing so as set forth in the rejection of the parent dependent claim.]

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. The following documents depict the state of the art.
Peake (US 20200074266 A1) teaches the use of both simulated and real-world data (see [0105]).
Kang et al., “Generalization through Simulation: Integrating Simulated and Real Data into Deep Reinforcement Learning for Vision-Based Autonomous Flight,” 2019 International Conference on Robotics and Automation (ICRA) teaches the use of a combination of simulated and real-world data for reinforcement learning.

THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to YAO DAVID HUANG whose telephone number is (571)270-1764. The examiner can normally be reached Monday - Friday 9:00 am - 5:30 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached at (571) 270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/Y.D.H./Examiner, Art Unit 2124                                                                                                                                                                                                        

/MIRANDA M HUANG/            Supervisory Patent Examiner, Art Unit 2124
Read full office action
Prosecution Timeline

Nov 10, 2022
Application Filed
Jul 19, 2025
Non-Final Rejection — §103
Jan 22, 2026
Response Filed
Feb 19, 2026
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/747,036
Patent 12536455
Method for Early Warning Brandish of Transmission Wire Based on Improved Bayes-Adaboost Algorithm
2y 5m to grant Granted Jan 27, 2026
16/566,504
Patent 12517958
SYSTEM AND METHOD FOR NEXT STEP PREDICTION OF ICS FLOW USING ARTIFICIAL INTELLIGENCE/MACHINE LEARNING
2y 5m to grant Granted Jan 06, 2026
17/223,859
Patent 12518218
DYNAMICALLY SCALABLE MACHINE LEARNING MODEL GENERATION AND RETRAINING THROUGH CONTAINERIZATION
2y 5m to grant Granted Jan 06, 2026
17/135,913
Patent 12488279
DOMAIN-SPECIFIC CONSTRAINTS FOR PREDICTIVE MODELING
2y 5m to grant Granted Dec 02, 2025
17/081,454
Patent 12475373
INFORMATION PROCESSING APPARATUS AND METHOD AND PROGRAM FOR GENERATING INTEGRATED MODEL
2y 5m to grant Granted Nov 18, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
63%
Grant Probability
95%
With Interview (+31.9%)
3y 11m
Median Time to Grant
Moderate
PTA Risk
Based on 124 resolved cases by this examiner. Grant probability derived from career allow rate.