Last updated: July 17, 2026
Application No. 17/118,165
EXPLAINABLE DEEP REINFORCEMENT LEARNING USING A FACTORIZED FUNCTION

Final Rejection §103
Filed
Dec 10, 2020
Examiner
DIEP, DUY T
Art Unit
2123
Tech Center
2100 — Computer Architecture & Software
Assignee
Genesee Valley Innovations LLC
OA Round
6 (Final)
Interview Optional

— +21.2% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 34% grant rate with +21.2% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 29 resolved cases, 2023–2026
Examiner Intelligence

DIEP, DUY T View full profile →
Grants only 34% of cases
Career Allowance Rate
10 granted / 29 resolved
-20.5% vs TC avg
Strong +21% interview lift
Without
With
+21.2%
Interview Lift
resolved cases with interview
Typical timeline
4y 3m
Avg Prosecution
18 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
1.6%
-38.4% vs TC avg
§103
98.4%
+58.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 29 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Response to Arguments
The arguments filed 04/02/2026 have been entered. Claims 1-12, 15-20, 22 remain pending in the application.
Applicant’s argument and amendment, with respect to the claim rejection(s) of claim(s) 1-12, 15-20, 22 under 35 U.S.C 103 filed 01/07/2026 have been fully considered but they are not persuasive.
Applicant argues that the cited combination fails to teach or suggest the limitations of independent claims 1 and 15. Applicant contends that Hafner merely discloses an ensemble of Q networks, where each Q network receives an observation-action pair and outputs a Q value, but Hafner does not disclose establishing a separate state-action value network for each of the two or more reward terms. According to Applicant, Hafner’s Q networks are organized as ensemble members, not as reward-term-specific value networks.
Applicant further argues that Baker does not cure this deficiency. Although Baker may disclose a node that permits forward activation while blocking backpropagation, Applicant asserts that Baker is directed to selective backpropagation in a general modified neural network and does not disclose using such a node to separate reward-term-specific state-action value networks from the claimed reinforcement-learning/learning network
Applicant also argues that Hafner does not teach the claimed explanatory output. Applicant contends that Hafner’s Q values are merely action-selection values or scores used to select an action, not an output that explains why the selected action was taken based on each trained state-action value network
Applicant further argues that Hafner does not teach the claimed compound reward function providing bootstrap estimates. Applicant states that Hafner’s target Q values/model-based value expansion targets are generated from Hafner’s reward models and Q networks, but Hafner does not disclose a compound reward function separated from the state-action value networks that provides bootstrap estimates during updating of the neural network.
Finally, Applicant argues that the dependent claims are patentable for at least the same reasons as the independent claims, because the cited references allegedly fail to teach the above limitations of independent claims 1 and 15.

The examiner respectfully disagrees. Applicant’s arguments have been considered but are not persuasive.
First, regarding the limitation of establishing a state-action value network for each of the two or more reward terms, the rejection relies on the combined teachings of Trout and Hafner, rather than Hafner alone. Trout teaches the claimed two or more reward terms because Trout discloses a plurality of sub-goals, with a sub-goal-specific reward function generated for each sub-goal, and further teaches composing, summing, or weighting the sub-goal-specific reward functions into a shaped/compound reward function. See Trout ¶¶27, 39–40. Hafner is relied upon for teaching a plurality/ensemble of Q networks, where each Q network is a state-action value network configured to generate a Q value/state-action value for an observation-action pair. In view of Trout’s multiple sub-goal-specific reward functions/reward terms, it would have been obvious to configure or apply Hafner’s plurality of Q networks such that each Q network evaluates a respective sub-goal-specific reward function/reward term, thereby providing a state-action value network for each of the two or more reward terms. A person of ordinary skill in the art would have been motivated to do so because using respective Q/state-action value networks for respective reward terms would allow the reinforcement-learning system to evaluate the contribution of each reward component to the selected action while retaining Hafner’s benefits of using multiple Q networks to improve value estimation and policy training.
Second, Applicant’s argument regarding Baker is not persuasive because Baker is not relied upon for teaching the entirety of the reward-term-specific state-action value network architecture. As discussed above, Trout and Hafner teach or suggest the state-action value networks corresponding to the respective reward terms. Baker is relied upon for teaching the gradient blocking node feature, namely a node/layer that allows forward activation to pass through while blocking or preventing backpropagation of gradients in the backward direction. Therefore, in the combined system of Trout and Hafner, it would have been obvious to place Baker’s gradient blocking node between the state-action value networks and the reinforcement learning network to permit the Q/value outputs to be used in the forward pass while preventing gradients from propagating backward into the separated state-action value networks. A person of ordinary skill in the art would have been motivated to do so to control gradient flow and preserve the separated value-network components while still allowing their outputs to inform reinforcement-learning action selection and/or training.
Third, Applicant’s argument regarding the explanatory output is not persuasive. Hafner’s Q values are state-action values used to evaluate possible actions and support action selection. Hafner teaches generating Q values for observation-action pairs and selecting actions based on those Q values. Thus, under the broadest reasonable interpretation, outputting the respective Q values/value-network outputs that support selection of the chosen action provides an explanation of why the action was taken. In the combined system with Trout, where the respective Q/state-action value networks are applied to corresponding sub-goal-specific reward functions/reward terms, the output of the respective trained value networks indicates how the reward-term-specific value estimates contributed to selection of the action. Accordingly, the combination teaches or at least suggests producing an output to explain why the action was taken based on each trained state-action value network.
Fourth, Applicant’s argument regarding the compound reward function and bootstrap estimates is not persuasive because the rejection relies on the combined teachings of Trout and Hafner. Hafner is relied upon for teaching the bootstrap/target-Q estimate mechanism, while Trout is relied upon for teaching the claimed compound reward function. Specifically, Hafner teaches generating target Q-values/model-based value expansion targets using reward information and Q-network value estimates, and using such target values during updating/training of the Q networks. These target Q-values are reasonably interpreted as bootstrap estimates because they are estimated training targets formed using Q/value estimates rather than only a final observed return. Trout teaches a shaped/compound reward function formed from multiple sub-goal-specific reward functions/reward terms. Thus, in the combined system, Trout’s compound reward function supplies the reward component used in Hafner’s target-Q/bootstrap-estimate generation, while Hafner’s Q/state-action value networks supply the value-estimate component. Accordingly, the combination teaches or at least suggests a compound reward function, separated from the state-action value networks, that provides reward information for bootstrap estimates during updating of the neural network.
Finally, Applicant’s arguments regarding the dependent claims are not persuasive for at least the same reasons discussed above with respect to independent claims 1 and 15. Applicant has not provided separate substantive arguments identifying any additional limitation of the dependent claims that is not taught or suggested by the cited references. Therefore, because the rejection of independent claims 1 and 15 remains proper, and because the additional limitations of the dependent claims are taught or suggested by the additionally cited references as set forth in the rejection, the dependent claims remain unpatentable.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-8, 11, 12, 15-20, 22 are rejected under 35 U.S.C. 103 as being unpatentable over Traut et.al (US 20210334696 A1) in view of Hafner et.al (US 20210201156 A1), further in view of Baker et.al (US 20220335296 A1), further in view of Jacobs et.al (US 20210295176 A1).

Regarding claim 1, 
Traut teaches or at least suggests “training a neural network, through a reinforcement learning algorithm, to determine a policy based on a compound reward function, the compound reward function comprising a sum of two or more reward terms” (paragraph 0068, where Traut discloses “reinforcement machine learning model configuration 108 may be configured according to any suitable combination of ML, AI, and/or NLP techniques…”. Traut discloses a configuration of reinforcement learning model wherein neural networks is used to train the model, including its reinforcement machine learning algorithm. Traut also discloses at paragraph 0025, “Shaped reward function 106 is one non-limiting example of a goal-oriented mechanism for training a machine learning policy” and at paragraph 0039, “shaped reward function may be simultaneously optimized so as to derive a suitable compound reward as a mathematical function of the different sub-goals … As described above, a shaped reward function may be generated by automatically computer composing a plurality of sub-goal specific reward functions ... automatically computer-composing sub-goal specific reward functions for each of a plurality of different criterion statements (e.g., “reach,” “avoid,” “drive,” and/or “optimize”)”. Traut further discloses a shaped reward function is derived from different multiple sub-goals, wherein each sub-goal has a corresponding sub-goal specific reward function. Because each sub-goal specific reward function provides a separate reward component for the overall shaped/compound reward, each sub-goal specific reward function reasonably corresponds to a respective reward term. Therefore, the combined shaped/compound reward function made from the plurality of sub-goal specific reward functions corresponds to the claimed compound reward function comprising a sum of two or more reward terms for training the machine learning policy.)
 Traut teaches or at least suggests “using the policy to choose an action of a plurality of possible actions” (paragraph 0013, where Traut discloses “As a result of reinforcement machine learning training, a reinforcement machine learning computer system may be configured to select a “best” action to perform, starting from any state in the state space” and paragraph 0025, where Traut discloses “Shaped reward function 106 is one non-limiting example of a goal-oriented mechanism for training a machine learning policy”. Traut discloses a system using the machine learning policy including the shaped reward function configured to select ‘best’ action out of a set of possible actions, corresponding to choosing an action of a plurality of possible actions, as claimed.)

Traut does not teach part of the limitation “establishing a state-action value network for each of the two or more reward terms, the state-action value networks separated from the learning network ...”. However, Hafner teaches or at least suggests the limitation (paragraph 0045 “Each Q network in the ensemble is a neural network that is configured to receive a Q network input… to generate a Q value for the input observation—action pair” and paragraph 0047 “the system uses the ensemble of Q networks in updating the model parameters 118 of a separate policy neural network 110”. Hafner discloses a reinforcement learning system to train while maintain the ensemble of Q networks, wherein each Q network is established during the consideration of each input observation—action pair. The policy neural network corresponds to the claimed learning network, which further corresponds to the Reinforcement learning neural network by Traut in view of the teaching combination below, the Q network corresponds to the claimed state-action value network, and in view of Traut’s multiple sub-goal reward functions/terms, Hafner’s plurality of Q networks would have been applied to evaluate diffrenet sub-goal reward functions/terms, thereby corresponding to the state-action value network for each of the two or more reward terms. These Q networks are trained separately from the policy neural network, thus having separated parameters. The Q network may further be separated from the policy network based on the difference in the technique as a person ordinary skilled in the art may further incorporate the technique within the teaching by Baker as disclosed below into the Q network.)
Traut does not teach the limitation “producing an output to explain why the action was taken based on each of the trained state action value networks”. However, Hafner teaches the limitation (paragraph 0037, where Hafner discloses “In another example, the action selection output 122 may include a respective Q-value for each action in the set of possible actions that can be performed by the agent ... The system 100 could also select the action with the highest Q-value as the action to be performed by the agent. The Q value for an action is an estimate of a “return” that would result from the agent performing the action in response to the current observation 120 and thereafter selecting future actions performed by the agent 102 in accordance with current values of the policy neural network parameters.”. Hafner discloses establishing a Q network during training of the action selection output to include a Q-value for each action from the set of actions, wherein the Q-value is an estimate of a “return” that would result from the agent performing the action, thereby helping the machine learning model to evaluate whether to select the action, thus corresponding to a rationale of why an action was taken (because it has a Q value indicating the highest reward ‘return’), thereby teaches or at least suggests the output to explain why the action was taken based on each of the trained state action value networks, as claimed.)
Traut does not teach the limitation “using the compound reward function separated from the state action value networks to provide bootstrap estimate of the outputs of the state action value networks during an update of the neural network”. However, Traut in view of Hafner teaches or at least suggest the limitation (Hafner discloses at paragraph 96 “ The system generates, using the ensemble of Q networks, L Q values for the time step”, paragraph 95 “ The system generates, using the ensemble of reward models, N rewards for the time step”, and paragraph 98 “To generate the candidate target Q values for the time step, the system computes, for each combination of (i) a reward model from the N reward models and (ii) a Q network from the L Q networks, a respective model-based value expansion target. In particular, the candidate target Q value for a given time step H in a given trajectory according to a reward model n and a Q network” Hafner discloses Hafner teaches generating candidate target Q values/model-based value expansion targets using reward information and Q-network value estimates. These target Q values correspond to bootstrap estimates because they are predictive target values used to update/refine the Q-network output. Hafner further teaches that reward information and Q-value generation are provided by components separate from the Q networks. In view of Trout’s teaching of a shaped/compound reward function derived from multiple sub-goal specific reward functions/terms, it would have been obvious to use Trout’s compound reward function as the reward component in Hafner’s target-Q/bootstrap-estimate generation. Thus, in the combined system, Trout’s compound reward function is separate from Hafner’s Q/state-action value networks and provides reward information used to form bootstrap estimates during updating of the neural network.)
Before the effective filing date of the invention, it would have been obvious to one of ordinary skill in the art to combine the teaching of a reinforcement learning algorithm with a shaped reward function comprises a sum combination of multiple sub-goal specific reward functions for each of a plurality of different criterion statements for training a machine learning policy to perform action selection by Traut with the teaching of establishing a Q network separated from the policy neural network for each input observation — action pair, wherein the rationale of the action selection is based on the Q value and using the combination of reward model and Q network to continuously update the Q value at each time step by Hafner.  The motivation to do so is referred in Hafner’s disclosure (para 0011, where Hafner discloses “In particular, by employing ensembles of models and combining target outputs both from trajectories of different lengths and from multiple trajectories of the same length, the described techniques can account for dynamics model uncertainty and Q network bias while still reducing the number of actual samples from the environment that are required to learn an effective policy”. Hafner discloses benefits of employing ensembles of Q network models with a Q value for each action, and to combine the target output from multiple trajectories of agent will result in help with dynamics model uncertainty and Q network bias while maintaining the effective policy even with reduced number of actual samples from environment. Both references introduce the goal of training an agent in selecting the best action from set of possible actions and using a reinforcement machine learning. As shown by Hafner of the establishment of a Q network for each model, by applying the ensemble of Q networks into the training policy of shaped reward function, which result in a Q network for training of each sub-goal reward function with criterion statement by Traut, will help to improve the effectiveness of Traut’s training policy for action selection with all benefits as mentioned above.) 

Traut/Hafner does not teach part of the limitation “... a gradient blocking node configured to pass forward activation and block backward gradients”. However, Baker teaches or at least suggests this limitation (paragraph 0037 “The third type of node in FIG. 1 is node 103 ... In addition, it does not follow the normal formula for back propagating partial derivatives of the error function based on the chain rule of calculus. Instead, it selectively blocks the back propagation to node 101, in spite of receiving input activation from node 101.”, paragraph 0111 “In block 542, the computer system computes the forward activation for all the nodes in the network.”. Baker discloses a system and method for modifying a machine learning network, such as a deep neural network, to introduce judgment to the network. Within the embodiment, Baker discloses a type of node which selectively blocks the back propagation to another node within the neural network. Baker also discloses computing forward activation for all nodes in the network, thus suggesting that the node 103 which contain the feature to selectively block the back propagation still able to pass forward activation input from node 101. The block 103 may block the backward propagation to node 101, wherein backward propagation is an algorithm that calculates the "gradient" of a loss function with respect to each weight in the network, thus blocking backward propagation indicating the claimed process of blocking backward gradients.)
Before the effective filing date of the invention, it would have been obvious to one of ordinary skill in the art to combine the teaching of a reinforcement learning algorithm with a shaped reward function comprises a sum combination of multiple sub-goal specific reward functions for each of a plurality of different criterion statements for training a machine learning policy to perform action selection by Traut, and the teaching of establishing a Q network separated from the policy neural network for each input observation — action pair, wherein the rationale of the action selection is based on the Q value and using the combination of reward model and Q network to continuously update the Q value at each time step by Hafner, with the teaching of a system and method for modifying a machine learning network with a node that block backward propagation and pass forward activation by Baker. The motivation to do so is referred to in Baker’s disclosure (Paragraph 0052 “A cascade of error judgment nodes with associated combining nodes 103, 213 with selective back propagation can be built to achieve a final error rate as low as feasible for any network, so the design and training of individual elements in the cascade may trade off the goal of immediately trying for the lowest possible error rate for other desirable attributes.”, and paragraph 0055 “if the back propagation from the combining node 103 to node 101 is blocked, the error judgment node 102 may be trained on development data 114 that has been set aside and is disjoint from the training data 113 that is used in training node 101. In some embodiments, the error judgment node 102 is trained on both the training data 113 and the development data 114, while node 101 is trained only on the training data 113. When trained on the development data, the error judgment node 102 can be trained by the computer system to learn an unbiased estimate of the performance that node 101 will have on new data. Thus, the computer system may train the error judgment node 102 to correct errors caused by node 101 overfitting its training data as well as correcting errors on the training data.” Baker discloses the benefit of blocking backward propagation for selective back propagation to achieve a final error rate as low as feasible for any network. Another benefit is to allow for the training of another node to correct errors caused by other node overfitting its training data. Therefore, the teaching by Traut/Hafner may incorporate the teaching by Baker into its Q network for further improvement.)

Traut/Hafner/Baker does not teach the limitation “training each state-action value network on a Bellman loss based on respective reward term;” However, Jacobs teaches or at least suggests this limitation (paragraph 81 “For training by reinforcement learning in an embodiment of the present invention, the agents use the Q-network described above to estimate the expected return, which is the total reward over the remainder of the episode when selecting a particular node as their next action ... Each piece of experience is defined by a current observation (graph) G, the action a chosen by the agent, the reward r received in response”, paragraph 0082, where Jacobs discloses “The Q-network is trained using the Bellman Equation” and paragraph 0083, where Jacob discloses “The right-hand side of the Bellman Equation serves as the target against which the training loss of the current Q-network is computed”. Jacob discloses the Q network is trained using the Bellman equation, which included the training loss indicating a Bellman loss function. Jacobs also discloses the training of the Q network is based on the received reward received in response to an action selection, thereby indicating that the training of the Q-network is based on the reward term of each action and the network is trained via the Bellman loss function to minimize the error as the agent select and agent and getting reward for it.)
Before the effective filing date of the invention, it would have been obvious to one of ordinary skill in the art to combine the teaching of Traut/Hafner/Baker with the teaching of using Bellman function to train the Q network by Jacobs. The motivation to do is referred to in Jacobs’s disclosure (paragraph 0057, where Jacobs discloses “Using a combination of reinforcement learning and mathematical programming to train an agent to generate near-optimal robust solutions to logistics optimization problems”. Jacob discloses the combination of mathematical programming including Bellman function with reinforcement learning can help train the agent to perform the most optimal solution to logistics problem. Due to both the teaching of Traut/Hafner/Baker and the teaching of Jacobs are referring to the training of the Q network, the training of Q network can thereby incorporate the mathematical programming of Bellman function to perform the most optimal solution to logistics problem, allowing the agent to make the best action selection.)

Regarding claim 2 depends on claim 1, thus the rejection of claim 1 is incorporated. 
Traut teaches “producing the output comprises producing a reward tradeoff space that plots the plurality of possible actions based on the two or more reward terms” (paragraph 0023, where Traut discloses “the reinforcement machine learning computer system 102 may be trained to predict a shaped reward value that would result from various actions, with regard to many different states in the state space”. Traut discloses the system to generate a shaped reward value, which is an output. Producing this output require a combination of many sub-goal specific reward functions with criterion statement, and consideration from various actions in regard to many different states in the state space. The state space is an implication of a reward tradeoff space as it comprises of various actions of different state.)

Regarding claim 3 depends on claim 2, thus the rejection of claim 2 is incorporated. 
Traut teaches “producing the reward tradeoff space comprises plotting possible actions with substantially equal reward based on the compound reward function on the same line” (paragraph 0038, where Traut discloses “a machine learning system may be configured to define a reward surface based on identifying different candidate sets of sub-goals that lead to a similar reward value”. Traut discloses identifying a reward surface for different candidate sets of sub-goals that lead to a similar reward value, wherein the candidate sets of sub-goals is another interpretation for possible action selected based on the reward value of the compound reward function and the whole operation is performed based on the configuration of state space.)

Regarding claim 4 depends on claim 3, thus the rejection of claim 3 is incorporated. 
Traut teaches “further comprising screening out possible actions that have substantially equal reward” (paragraph 0051, where Traut discloses “method 200 may further comprise computer-indexing the experiment log based on the human-readable label(s) associated with one or more sub-goals. For example, the experiment log may be indexed to find all of the instances of successes and/or failures”. Traut discloses an experiment log of human-readable label which can be used to identify all instances of reward successes, as all rewards are considered equal if they are all succeed, with each sub-goal represents a possible action.)

Regarding claim 5 depends on claim 2, thus the rejection of claim 2 is incorporated. 
Traut teaches “further comprising screening out possible actions that have substantially similar reward based on a similarity threshold” (paragraph 0052, where Traut discloses “a performance metric for training based on the plurality of training sub-goals. For example, the performance metric may include a frequency of successfully completing a sub-goal during the training experiment”. Traut discloses a performance metric that include a frequency to determine the success of each sub-goal during the training. Each sub-goal comprises of each possible action.)
	
Regarding claim 6 depends on claim 5, thus the rejection of claim 5 is incorporated. 
Traut teaches “the similarity threshold is a predetermined value” (paragraph 0026, where in Traut discloses “the training specification may be processed to automatically derive one or more performance metrics for a reinforcement machine learning computer system that may be automatically measured during training”. Trait discloses a performance metric will be automatically defined based on the training specification, which indicates it was automatically predetermine for the usage of the system.)

Regarding claim 7 depends on claim 5, thus the rejection of claim 5 is incorporated. 
Traut teaches “the similarity threshold is specified by a user” (paragraph 0051, where Traut discloses “a training experiment may be configured to log information pertaining to the human-readable label into an experiment log. For example, the logged information may include… performance metrics pertaining to the sub-goal” Traut discloses the training experiment is configured to log information which needs to be specified by user as they are human-readable label, the logged information include performance metrics.)

Regarding claim 8 depends on claim 5, thus the rejection of claim 5 is incorporated. 
Traut teaches “the similarity threshold is based on a number of possible actions” (paragraph 0052, where Traut discloses “a performance metric for training based on the plurality of training sub-goals”. Traut discloses the performance metric for training is based on the plurality of training sub-goals, where in each sub-goal represent a possible action.)
	
Regarding claim 11 depends on claim 1, thus the rejection of claim 1 is incorporated. 
Hafner teaches “learning through the learning network and learning through the state action value networks are done at substantially the same time” (paragraph 0048, where Hafner discloses “the training engine 116 can train the Q networks jointly with the policy neural network 110”. Hafner discloses the training of Q networks within the ensemble can be done jointly with the policy neural network, thus implying that both networks can be trained at substantially the same time.)

	Regarding claim 12 depends on claim 1, thus the rejection of claim 1 is incorporated. 
Traut teaches “the policy is configured to maximize an output of the compound reward function” (paragraph 0034, where Traut discloses ““Optimize” statements 305A may take the form of any suitable syntax for describing an aspect of a training experiment state to be optimized, for example minimization or maximization” and paragraph 0039, where Traut discloses “a machine learning system and/or shaped reward function may be simultaneously optimized so as to derive a suitable compound reward”. Traut discloses a machine learning system including policy with the reward function being optimized to derive a suitable compound reward. The optimization process can be interpreted as a maximization process to maximize the compound reward as an output.)

Regarding claim 15. The applicant is further directed to the rejections to claim 1 set forth above, because claim 15 comprises limitations similar to claim 1, and they are rejected based on the same rationale.

Regarding claim 16 depends on claim 15, thus the rejection of claim 15 is incorporated. The applicant is further directed to the rejections to claim 2 set forth above, because claim 16 comprises limitations similar to claim 2, and they are rejected based on the same rationale.

Regarding claim 17 depends on claim 16, thus the rejection of claim 16 is incorporated. The applicant is further directed to the rejections to claim 3 set forth above, because claim 17 comprises limitations similar to claim 3, and they are rejected based on the same rationale.

Regarding claim 18 depends on claim 17, thus the rejection of claim 17 is incorporated. The applicant is further directed to the rejections to claim 4 set forth above, because claim 18 comprises limitations similar to claim 4, and they are rejected based on the same rationale.

Regarding claim 19 depends on claim 16, thus the rejection of claim 16 is incorporated. The applicant is further directed to the rejections to claim 5 set forth above, because claim 19 comprises limitations similar to claim 5, and they are rejected based on the same rationale.

Regarding claim 20 depends on claim 15, thus the rejection of claim 15 is incorporated. The applicant is further directed to the rejections to claim 9 set forth above, because claim 20 comprises limitations similar to claim 9, and they are rejected based on the same rationale.

Regarding claim 20 depends on claim 15, thus the rejection of claim 15 is incorporated
Hafner teaches the limitation “The system of claim 15, wherein the state action value networks are configured to take both an action and a state as input” (paragraph 45 “Each Q network in the ensemble is a neural network that is configured to receive a Q network input that includes (i) an input observation characterizing a state of an environment being interacted with by an agent and (ii) data identifying an action from a set of actions and process the Q network input to generate a Q value for the input observation—action pair.” Hafner discloses each Q network (state action value network) in the ensemble is configured to receive an input comprising an input observation characterizing a state of an environment being interacted with by an agent which corresponds to the claimed state input and data identifying an action from a set of actions which corresponds to the claimed action input.)


Claim 9, 10 are rejected under 35 U.S.C. 103 as being unpatentable over Traut et.al (US 20210334696 A1) in view of Hafner et.al (US 20210201156 A1), further in view of Baker et.al (US 20220335296 A1), further in view of Jacobs et.al (US 20210295176 A1), further in view of Warde-Farley et.al (US 11263531 B2)

Regarding claim 9 which is depend on claim 1, thus the rejection of claim 1 is incorporated.
Traut/Hafner/Baker/Jacobs teaches the limitation of “the state action value network” and “the learning network” as explained in claim 1, but does not teach “the state action value networks share a latent embedding representation with the learning network”. However, Warde-Farley teaches this limitation (Column 7, line 33-36, where Warde-Farley discloses “The embedding neural network is configured to process an observation characterizing a state of the environment… to generate an embedding of the observation” and Column 3, line 30-32, where Warde-Farley discloses “the embedding neural network and the action selection neural network share one or more parameter values”. Warde-Farley discloses the action selection neural network and the embedding neural network. Both networks share parameters which also include an embedded representation of the observation characterizing a state of the environment.)
	Before the effective filing date of the invention, it would have been obvious to one of ordinary skill in the art to combine the teaching of the Q network, the policy neural network, and modifying a machine learning network with a node that block backward propagation and pass forward activation by the combination of Traut/Hafner/Baker/Jacobs with the teaching of the action selection neural network share a latent embedding representation with the embedding neural network by Warde-Farley. The motivation to do so is referred in Warde-Farley’s disclosure (Column 8, line 35-38, where Warde-Farley discloses “The training engine 116 is configured to jointly train the model parameters 118 of the action selection network 110 and the embedding network 112 based on the interactions of the agent with the environment.” Warde-Farley discloses training the action selection network and the embedding network jointly. It can be understood from Warde-Farley’s disclosure that the action selection neural network is corresponds to the Q network, and the embedding neural network corresponds to the policy neural network of Traut/Hafner. Since both networks from Warde-Farley is a similar interpretation of two networks from the Traut/Hafner combination and according to Warde-Farley, they can be trained jointly with shared embedded representation, thus the Q network and the learning network by Traut/Hafner can also incorporate the feature of sharing the same embedding representation introduced in Warde-Farley.)

Regarding claim 10 depends on claim 9, thus the rejection of claim 9 is incorporated. The applicant is further directed to the rejections to claim 1 set forth above, because claim 10 comprises limitations similar to claim 1, and they are rejected based on the same rationale.

Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to DUY TU DIEP whose telephone number is (703)756-1738. The examiner can normally be reached M-F 8-4:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Alexey Shmatov can be reached at (571) 270-3428. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/DUY T DIEP/Examiner, Art Unit 2123                                                                                                                                                                                                        
/ALEXEY SHMATOV/Supervisory Patent Examiner, Art Unit 2123
Read full office action
Prosecution Timeline

Show 10 earlier events
May 19, 2025
Response Filed
Aug 20, 2025
Final Rejection mailed — §103
Oct 20, 2025
Response after Non-Final Action
Nov 20, 2025
Request for Continued Examination
Nov 30, 2025
Response after Non-Final Action
Jan 07, 2026
Non-Final Rejection mailed — §103
Apr 02, 2026
Response Filed
Jun 29, 2026
Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/721,989
Patent 12651158
NEURAL NETWORK TRAINING METHOD AND APPARATUS USING TREND
4y 1m to grant Granted Jun 09, 2026
17/459,157
Patent 12608642
MODEL PARAMETER LEARNING METHOD AND MOVEMENT MODE DETERMINATION METHOD
4y 7m to grant Granted Apr 21, 2026
17/551,821
Patent 12579428
METHOD FOR INJECTING HUMAN KNOWLEDGE INTO AI MODELS
4y 3m to grant Granted Mar 17, 2026
17/557,096
Patent 12488223
FEDERATED LEARNING FOR TRAINING MACHINE LEARNING MODELS
3y 11m to grant Granted Dec 02, 2025
17/317,908
Patent 12412129
DISTRIBUTED SUPPORT VECTOR MACHINE PRIVACY-PRESERVING METHOD, SYSTEM, STORAGE MEDIUM AND APPLICATION
4y 4m to grant Granted Sep 09, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

7-8
Expected OA Rounds
34%
Grant Probability
56%
With Interview (+21.2%)
4y 3m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 29 resolved cases by this examiner. Grant probability derived from career allowance rate.