Prosecution Insights
Last updated: April 19, 2026
Application No. 17/118,165

EXPLAINABLE DEEP REINFORCEMENT LEARNING USING A FACTORIZED FUNCTION

Non-Final OA §103
Filed
Dec 10, 2020
Examiner
DIEP, DUY T
Art Unit
2123
Tech Center
2100 — Computer Architecture & Software
Assignee
Genesee Valley Innovations LLC
OA Round
5 (Non-Final)
25%
Grant Probability
At Risk
5-6
OA Rounds
4y 2m
To Grant
30%
With Interview

Examiner Intelligence

Grants only 25% of cases
25%
Career Allow Rate
5 granted / 20 resolved
-30.0% vs TC avg
Moderate +6% lift
Without
With
+5.5%
Interview Lift
resolved cases with interview
Typical timeline
4y 2m
Avg Prosecution
39 currently pending
Career history
59
Total Applications
across all art units

Statute-Specific Performance

§101
34.1%
-5.9% vs TC avg
§103
54.0%
+14.0% vs TC avg
§102
2.3%
-37.7% vs TC avg
§112
9.6%
-30.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 20 resolved cases

Office Action

§103
Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Continued Examination Under 37 CFR 1.114 A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 11/20/2025 has been entered. Response to Amendments The amendments filed 11/20/2025 have been entered. Claims 1-12, 15-20, 22 remain pending in the application. Applicant’s argument and amendment, with respect to the claim rejection(s) of claim(s) 1-12, 15-20, 22 under 35 U.S.C 103 filed 08/20/2025 have been fully considered but they are not persuasive. The applicant argues that the references does not teach the amended claim, which include, in some form, that during an update of the neural network, the processor is configured to use the compound reward function separated from the state action value networks to provide bootstrap estimates of the outputs of the state action value. Nothing has been identified by the Examiner in the cited references that teaches at least these claim features. The applicant argues that Jacobs, alone or in combination with Traut, Hafner, and Baker does not teach training each state-action value network on a Bellman loss based on a respective reward term and producing an output to explain why the action was taken based on each of the trained state action value networks as recited in each of independent claims 1 and 15. The examiner respectfully disagrees, Hafner teaches the compound reward function separated from the state action value networks to provide bootstrap estimates of the outputs of the state action value at paragraph 96 “ The system generates, using the ensemble of Q networks, L Q values for the time step”, paragraph 95 “ The system generates, using the ensemble of reward models, N rewards for the time step”, and paragraph 98 “To generate the candidate target Q values for the time step, the system computes, for each combination of (i) a reward model from the N reward models and (ii) a Q network from the L Q networks, a respective model-based value expansion target. In particular, the candidate target Q value for a given time step H in a given trajectory according to a reward model n and a Q network” Hafner teaches a system that computes Q values using an ensemble of Q networks and am ensemble of reward models, indicating that reward generation and Q-value generation are performed by separate model groups. Hafner further teaches generating, for each time step, a model-based value expansion target that serves as a per-timestep Q-value target for updating the present Q network output. This target Q value corresponds to the claimed “bootstrap estimate of the outputs of the state action value networks” under a broadest reasonable interpretation, as it provides a predictive target Q value used to refine the Q output at a specific time step based on combination with a reward provided from the reward model. The reward model provides rewards as a contribution to the expansion target for update at the next time step and the Q network may generate the candidate target Q values at that time step based on the rewards. The ensemble of reward models in Hafner is reasonably read in combination with Traut’s teaching of shaped reward function to derive a compound reward. The generated per-timestep Q-value target in Hafner is therefore properly characterized as analogous to a bootstrap estimate of the output of the state action value networks, because it is produced by a reward-based model component separated from the Q network and provides an estimate update target for improving the Q’s network current output per time step.) Furthermore, Jacobs in combination with Traut, Hafner, and Baker teaches training each state-action value network on a Bellman loss based on a respective reward term. Jacobs teaches this at paragraph 81 “For training by reinforcement learning in an embodiment of the present invention, the agents use the Q-network described above to estimate the expected return, which is the total reward over the remainder of the episode when selecting a particular node as their next action ... Each piece of experience is defined by a current observation (graph) G, the action a chosen by the agent, the reward r received in response”, paragraph 0082 “The Q-network is trained using the Bellman Equation” and paragraph 0083 “The right-hand side of the Bellman Equation serves as the target against which the training loss of the current Q-network is computed”. Jacob discloses the Q network is trained using the Bellman equation, which included the training loss indicating a Bellman loss function. Jacobs also discloses the training of the Q network is based on the received reward received in response to an action selection, thereby indicating that the training of the Q-network is based on the reward term of each action and the network is trained via the Bellman loss function to minimize the error as the agent select and agent and getting reward for it. Hafner teaches producing an output to explain why the action was taken based on each of the trained state action value networks at paragraph 0037, where Hafner discloses “In another example, the action selection output 122 may include a respective Q-value for each action in the set of possible actions that can be performed by the agent”. Hafner discloses establishing a Q network during training of the action selection output to include a Q-value for each action from the set of actions as the rationale to select the action. Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 1-8, 11, 12, 15-20, 22 are rejected under 35 U.S.C. 103 as being unpatentable over Traut et.al (US 20210334696 A1) in view of Hafner et.al (US 20210201156 A1), further in view of Baker et.al (US 20220335296 A1), further in view of Jacobs et.al (US 20210295176 A1) Regarding claim 1, Traut teaches “training a neural network, through a reinforcement learning algorithm, to determine a policy based on a compound reward function, the compound reward function comprising a sum of two or more reward terms” (paragraph 0068, where Traut discloses “reinforcement machine learning model configuration 108 may be configured according to any suitable combination of ML, AI, and/or NLP techniques…”. Traut discloses a configuration of reinforcement learning model wherein neural networks is used to train the model, including its reinforcement machine learning algorithm. Traut also discloses at paragraph 0025, “Shaped reward function 106 is one non-limiting example of a goal-oriented mechanism for training a machine learning policy” and at paragraph 0039, “shaped reward function may be simultaneously optimized so as to derive a suitable compound reward as a mathematical function of the different sub-goals … As described above, a shaped reward function may be generated by automatically computer composing a plurality of sub-goal specific reward functions ... automatically computer-composing sub-goal specific reward functions for each of a plurality of different criterion statements (e.g., “reach,” “avoid,” “drive,” and/or “optimize”)”. Traut further discloses a shaped reward function which derive compound reward which perform the same function as a compound reward function for training the machine learning policy. The shaped reward function by Traut further comprises a sum combination of multiple sub-goal specific reward functions for each of a plurality of different criterion statements, with these criterion statements as an implication of reward terms.) Traut teaches “using the policy to choose an action of a plurality of possible actions” (paragraph 0013, where Traut discloses “As a result of reinforcement machine learning training, a reinforcement machine learning computer system may be configured to select a “best” action to perform, starting from any state in the state space” and paragraph 0025, where Traut discloses “Shaped reward function 106 is one non-limiting example of a goal-oriented mechanism for training a machine learning policy”. Traut discloses a system using the machine learning policy including the shaped reward function configured to select ‘best’ action out of a set of possible actions.) Traut does not teach part of the limitation “establishing a state-action value network for each of the two or more reward terms, the state-action value networks separated from the learning network ...”. However, Hafner teaches the limitation (paragraph 0045, where Hafner discloses “Each Q network in the ensemble is a neural network that is configured to receive a Q network input… to generate a Q value for the input observation—action pair” and paragraph 0047, where Hafner discloses “the system uses the ensemble of Q networks in updating the model parameters 118 of a separate policy neural network 110”. Hafner discloses a reinforcement learning system to train while maintain the ensemble of Q networks, wherein each Q network is established during the consideration of each input observation—action pair. The policy neural network is an implication of a learning network, the Q network is an implication of the state-action value network, and the input observation — action pair is an implication of reward terms. These Q networks are trained separately from the policy neural network, thus having separated parameters. Due to each network is employed for each input observation – action pair, thus implying that any parameters or embedded representations in general might be shared between networks but are still separated from each other. The Q network may further be separated from the policy network based on the difference in the technique as a person ordinary skilled in the art may further incorporate the technique within the teaching by Baker as disclosed below into the Q network.) Traut does not teach the limitation “producing an output to explain why the action was taken based on each of the trained state action value networks”. However, Hafner teaches the limitation (paragraph 0037, where Hafner discloses “In another example, the action selection output 122 may include a respective Q-value for each action in the set of possible actions that can be performed by the agent”. Hafner discloses establishing a Q network during training of the action selection output to include a Q-value for each action from the set of actions as the rationale to select the action.) Traut does not teach the limitation “using the compound reward function separated from the state action value networks to provide bootstrap estimate of the outputs of the state action value networks during an update of the neural network”. However, Hafner teaches the limitation (paragraph 96 “ The system generates, using the ensemble of Q networks, L Q values for the time step”, paragraph 95 “ The system generates, using the ensemble of reward models, N rewards for the time step”, and paragraph 98 “To generate the candidate target Q values for the time step, the system computes, for each combination of (i) a reward model from the N reward models and (ii) a Q network from the L Q networks, a respective model-based value expansion target. In particular, the candidate target Q value for a given time step H in a given trajectory according to a reward model n and a Q network” Hafner discloses a system that computes Q values using an ensemble of Q networks and am ensemble of reward models, indicating that reward generation and Q-value generation are performed by separate model groups. Hafner further teaches generating, for each time step, a model-based value expansion target that serves as a per-timestep Q-value target for updating the present Q network output. This target Q value corresponds to the claimed “bootstrap estimate of the outputs of the state action value networks” under a broadest reasonable interpretation, as it provides a predictive target Q value used to refine the Q output at a specific time step based on combination with a reward provided from the reward model. The reward model provides rewards as a contribution to the expansion target for update at the next time step and the Q network may generate the candidate target Q values at that time step based on the rewards. The ensemble of reward models in Hafner is reasonably read in combination with Traut’s teaching of shaped reward function to derive a compound reward. The generated per-timestep Q-value target in Hafner is therefore properly characterized as analogous to a bootstrap estimate of the output of the state action value networks, because it is produced by a reward-based model component separated from the Q network and provides an estimate update target for improving the Q’s network current output per time step.) Before the effective filing date of the invention, it would have been obvious to one of ordinary skill in the art to combine the teaching of a reinforcement learning algorithm with a shaped reward function comprises a sum combination of multiple sub-goal specific reward functions for each of a plurality of different criterion statements for training a machine learning policy to perform action selection by Traut with the teaching of establishing a Q network separated from the policy neural network for each input observation — action pair, wherein the rationale of the action selection is based on the Q value and using the combination of reward model and Q network to continuously update the Q value at each time step by Hafner. The motivation to do so is referred in Hafner’s disclosure (para 0011, where Hafner discloses “In particular, by employing ensembles of models and combining target outputs both from trajectories of different lengths and from multiple trajectories of the same length, the described techniques can account for dynamics model uncertainty and Q network bias while still reducing the number of actual samples from the environment that are required to learn an effective policy”. Hafner discloses benefits of employing ensembles of Q network models with a Q value for each action, and to combine the target output from multiple trajectories of agent will result in help with dynamics model uncertainty and Q network bias while maintaining the effective policy even with reduced number of actual samples from environment. Both references introduce the goal of training an agent in selecting the best action from set of possible actions and using a reinforcement machine learning. As shown by Hafner of the establishment of a Q network for each model, by applying the ensemble of Q networks into the training policy of shaped reward function, which result in a Q network for training of each sub-goal reward function with criterion statement by Traut, will help to improve the effectiveness of Traut’s training policy for action selection with all benefits as mentioned above.) Traut/Hafner does not teach part of the limitation “... a gradient blocking node configured to pass forward activation and block backward gradients”. However, Baker teaches this limitation (paragraph 0037 “The third type of node in FIG. 1 is node 103 ... In addition, it does not follow the normal formula for back propagating partial derivatives of the error function based on the chain rule of calculus. Instead, it selectively blocks the back propagation to node 101, in spite of receiving input activation from node 101.”, paragraph 0111 “In block 542, the computer system computes the forward activation for all the nodes in the network.”. Baker discloses a system and method for modifying a machine learning network, such as a deep neural network, to introduce judgment to the network. Within the embodiment, Baker discloses a type of node which selectively blocks the back propagation to another node within the neural network. Baker also discloses computing forward activation for all nodes in the network, thus suggesting that the node 103 which contain the feature to selectively block the back propagation still able to pass forward activation input from node 101. The block 103 may block the backward propagation to node 101, wherein backward propagation is an algorithm that calculates the "gradient" of a loss function with respect to each weight in the network, thus blocking backward propagation indicating the claimed process of blocking backward gradients.) Before the effective filing date of the invention, it would have been obvious to one of ordinary skill in the art to combine the teaching of a reinforcement learning algorithm with a shaped reward function comprises a sum combination of multiple sub-goal specific reward functions for each of a plurality of different criterion statements for training a machine learning policy to perform action selection by Traut, and the teaching of establishing a Q network separated from the policy neural network for each input observation — action pair, wherein the rationale of the action selection is based on the Q value and using the combination of reward model and Q network to continuously update the Q value at each time step by Hafner, with the teaching of a system and method for modifying a machine learning network with a node that block backward propagation and pass forward activation by Baker. The motivation to do so is referred to in Baker’s disclosure (Paragraph 0052 “A cascade of error judgment nodes with associated combining nodes 103, 213 with selective back propagation can be built to achieve a final error rate as low as feasible for any network, so the design and training of individual elements in the cascade may trade off the goal of immediately trying for the lowest possible error rate for other desirable attributes.”, and paragraph 0055 “if the back propagation from the combining node 103 to node 101 is blocked, the error judgment node 102 may be trained on development data 114 that has been set aside and is disjoint from the training data 113 that is used in training node 101. In some embodiments, the error judgment node 102 is trained on both the training data 113 and the development data 114, while node 101 is trained only on the training data 113. When trained on the development data, the error judgment node 102 can be trained by the computer system to learn an unbiased estimate of the performance that node 101 will have on new data. Thus, the computer system may train the error judgment node 102 to correct errors caused by node 101 overfitting its training data as well as correcting errors on the training data.” Baker discloses the benefit of blocking backward propagation for selective back propagation to achieve a final error rate as low as feasible for any network. Another benefit is to allow for the training of another node to correct errors caused by other node overfitting its training data. Therefore, the teaching by Traut/Hafner may incorporate the teaching by Baker into its Q network for further improvement.) Traut/Hafner/Baker does not teach the limitation “training each state-action value network on a Bellman loss based on respective reward term;” However, Jacobs teaches this limitation (paragraph 81 “For training by reinforcement learning in an embodiment of the present invention, the agents use the Q-network described above to estimate the expected return, which is the total reward over the remainder of the episode when selecting a particular node as their next action ... Each piece of experience is defined by a current observation (graph) G, the action a chosen by the agent, the reward r received in response”, paragraph 0082, where Jacobs discloses “The Q-network is trained using the Bellman Equation” and paragraph 0083, where Jacob discloses “The right-hand side of the Bellman Equation serves as the target against which the training loss of the current Q-network is computed”. Jacob discloses the Q network is trained using the Bellman equation, which included the training loss indicating a Bellman loss function. Jacobs also discloses the training of the Q network is based on the received reward received in response to an action selection, thereby indicating that the training of the Q-network is based on the reward term of each action and the network is trained via the Bellman loss function to minimize the error as the agent select and agent and getting reward for it.) Before the effective filing date of the invention, it would have been obvious to one of ordinary skill in the art to combine the teaching of Traut/Hafner/Baker with the teaching of using Bellman function to train the Q network by Jacobs. The motivation to do is referred to in Jacobs’s disclosure (paragraph 0057, where Jacobs discloses “Using a combination of reinforcement learning and mathematical programming to train an agent to generate near-optimal robust solutions to logistics optimization problems”. Jacob discloses the combination of mathematical programming including Bellman function with reinforcement learning can help train the agent to perform the most optimal solution to logistics problem. Due to both the teaching of Traut/Hafner/Baker and the teaching of Jacobs are referring to the training of the Q network, the training of Q network can thereby incorporate the mathematical programming of Bellman function to perform the most optimal solution to logistics problem, allowing the agent to make the best action selection.) Regarding claim 2 depends on claim 1, thus the rejection of claim 1 is incorporated. Traut teaches “producing the output comprises producing a reward tradeoff space that plots the plurality of possible actions based on the two or more reward terms” (paragraph 0023, where Traut discloses “the reinforcement machine learning computer system 102 may be trained to predict a shaped reward value that would result from various actions, with regard to many different states in the state space”. Traut discloses the system to generate a shaped reward value, which is an output. Producing this output require a combination of many sub-goal specific reward functions with criterion statement, and consideration from various actions in regard to many different states in the state space. The state space is an implication of a reward tradeoff space as it comprises of various actions of different state.) Regarding claim 3 depends on claim 2, thus the rejection of claim 2 is incorporated. Traut teaches “producing the reward tradeoff space comprises plotting possible actions with substantially equal reward based on the compound reward function on the same line” (paragraph 0038, where Traut discloses “a machine learning system may be configured to define a reward surface based on identifying different candidate sets of sub-goals that lead to a similar reward value”. Traut discloses identifying a reward surface for different candidate sets of sub-goals that lead to a similar reward value, wherein the candidate sets of sub-goals is another interpretation for possible action selected based on the reward value of the compound reward function and the whole operation is performed based on the configuration of state space.) Regarding claim 4 depends on claim 3, thus the rejection of claim 3 is incorporated. Traut teaches “further comprising screening out possible actions that have substantially equal reward” (paragraph 0051, where Traut discloses “method 200 may further comprise computer-indexing the experiment log based on the human-readable label(s) associated with one or more sub-goals. For example, the experiment log may be indexed to find all of the instances of successes and/or failures”. Traut discloses an experiment log of human-readable label which can be used to identify all instances of reward successes, as all rewards are considered equal if they are all succeed, with each sub-goal represents a possible action.) Regarding claim 5 depends on claim 2, thus the rejection of claim 2 is incorporated. Traut teaches “further comprising screening out possible actions that have substantially similar reward based on a similarity threshold” (paragraph 0052, where Traut discloses “a performance metric for training based on the plurality of training sub-goals. For example, the performance metric may include a frequency of successfully completing a sub-goal during the training experiment”. Traut discloses a performance metric that include a frequency to determine the success of each sub-goal during the training. Each sub-goal comprises of each possible action.) Regarding claim 6 depends on claim 5, thus the rejection of claim 5 is incorporated. Traut teaches “the similarity threshold is a predetermined value” (paragraph 0026, where in Traut discloses “the training specification may be processed to automatically derive one or more performance metrics for a reinforcement machine learning computer system that may be automatically measured during training”. Trait discloses a performance metric will be automatically defined based on the training specification, which indicates it was automatically predetermine for the usage of the system.) Regarding claim 7 depends on claim 5, thus the rejection of claim 5 is incorporated. Traut teaches “the similarity threshold is specified by a user” (paragraph 0051, where Traut discloses “a training experiment may be configured to log information pertaining to the human-readable label into an experiment log. For example, the logged information may include… performance metrics pertaining to the sub-goal” Traut discloses the training experiment is configured to log information which needs to be specified by user as they are human-readable label, the logged information include performance metrics.) Regarding claim 8 depends on claim 5, thus the rejection of claim 5 is incorporated. Traut teaches “the similarity threshold is based on a number of possible actions” (paragraph 0052, where Traut discloses “a performance metric for training based on the plurality of training sub-goals”. Traut discloses the performance metric for training is based on the plurality of training sub-goals, where in each sub-goal represent a possible action.) Regarding claim 11 depends on claim 1, thus the rejection of claim 1 is incorporated. Hafner teaches “learning through the learning network and learning through the state action value networks are done at substantially the same time” (paragraph 0048, where Hafner discloses “the training engine 116 can train the Q networks jointly with the policy neural network 110”. Hafner discloses the training of Q networks within the ensemble can be done jointly with the policy neural network, thus implying that both networks can be trained at substantially the same time.) Regarding claim 12 depends on claim 1, thus the rejection of claim 1 is incorporated. Traut teaches “the policy is configured to maximize an output of the compound reward function” (paragraph 0034, where Traut discloses ““Optimize” statements 305A may take the form of any suitable syntax for describing an aspect of a training experiment state to be optimized, for example minimization or maximization” and paragraph 0039, where Traut discloses “a machine learning system and/or shaped reward function may be simultaneously optimized so as to derive a suitable compound reward”. Traut discloses a machine learning system including policy with the reward function being optimized to derive a suitable compound reward. The optimization process can be interpreted as a maximization process to maximize the compound reward as an output.) Regarding claim 15. The applicant is further directed to the rejections to claim 1 set forth above, because claim 15 comprises limitations similar to claim 1, and they are rejected based on the same rationale. Regarding claim 16 depends on claim 15, thus the rejection of claim 15 is incorporated. The applicant is further directed to the rejections to claim 2 set forth above, because claim 16 comprises limitations similar to claim 2, and they are rejected based on the same rationale. Regarding claim 17 depends on claim 16, thus the rejection of claim 16 is incorporated. The applicant is further directed to the rejections to claim 3 set forth above, because claim 17 comprises limitations similar to claim 3, and they are rejected based on the same rationale. Regarding claim 18 depends on claim 17, thus the rejection of claim 17 is incorporated. The applicant is further directed to the rejections to claim 4 set forth above, because claim 18 comprises limitations similar to claim 4, and they are rejected based on the same rationale. Regarding claim 19 depends on claim 16, thus the rejection of claim 16 is incorporated. The applicant is further directed to the rejections to claim 5 set forth above, because claim 19 comprises limitations similar to claim 5, and they are rejected based on the same rationale. Regarding claim 20 depends on claim 15, thus the rejection of claim 15 is incorporated. The applicant is further directed to the rejections to claim 9 set forth above, because claim 20 comprises limitations similar to claim 9, and they are rejected based on the same rationale. Regarding claim 20 depends on claim 15, thus the rejection of claim 15 is incorporated Hafner teaches the limitation “The system of claim 15, wherein the state action value networks are configured to take both an action and a state as input” (paragraph 45 “Each Q network in the ensemble is a neural network that is configured to receive a Q network input that includes (i) an input observation characterizing a state of an environment being interacted with by an agent and (ii) data identifying an action from a set of actions and process the Q network input to generate a Q value for the input observation—action pair.” Hafner discloses each Q network (state action value network) in the ensemble is configured to receive an input comprising an input observation characterizing a state of an environment being interacted with by an agent which corresponds to the claimed state input and data identifying an action from a set of actions which corresponds to the claimed action input.) Claim 9, 10 are rejected under 35 U.S.C. 103 as being unpatentable over Traut et.al (US 20210334696 A1) in view of Hafner et.al (US 20210201156 A1), further in view of Baker et.al (US 20220335296 A1), further in view of Jacobs et.al (US 20210295176 A1), further in view of Warde-Farley et.al (US 11263531 B2) Regarding claim 9 which is depend on claim 1, thus the rejection of claim 1 is incorporated. The teaching combination of Traut/Hafner/Baker/Jacobs teaches the limitation of “the state action value network” and “the learning network” as explained in claim 1, but does not teach “the state action value networks share a latent embedding representation with the learning network”. However, Warde-Farley teaches this limitation (Column 7, line 33-36, where Warde-Farley discloses “The embedding neural network is configured to process an observation characterizing a state of the environment… to generate an embedding of the observation” and Column 3, line 30-32, where Warde-Farley discloses “the embedding neural network and the action selection neural network share one or more parameter values”. Warde-Farley discloses the action selection neural network and the embedding neural network. Both networks share parameters which also include an embedded representation of the observation characterizing a state of the environment.) Before the effective filing date of the invention, it would have been obvious to one of ordinary skill in the art to combine the teaching of the Q network, the policy neural network, and modifying a machine learning network with a node that block backward propagation and pass forward activation by the combination of Traut/Hafner/Baker/Jacobs with the teaching of the action selection neural network share a latent embedding representation with the embedding neural network by Warde-Farley. The motivation to do so is referred in Warde-Farley’s disclosure (Column 8, line 35-38, where Warde-Farley discloses “The training engine 116 is configured to jointly train the model parameters 118 of the action selection network 110 and the embedding network 112 based on the interactions of the agent with the environment.” Warde-Farley discloses training the action selection network and the embedding network jointly. It can be understood from Warde-Farley’s disclosure that the action selection neural network is an implication of the Q network and the embedding neural network is an implication of the policy neural network. Since both networks from Warde-Farley is a similar interpretation of two networks from the Traut/Hafner combination and according to Warde-Farley, they can be trained jointly with shared embedded representation, thus the Q network and the learning network by Traut/Hafner can also incorporate the feature of sharing the same embedding representation introduced in Warde-Farley.) Regarding claim 10 depends on claim 9, thus the rejection of claim 9 is incorporated. The applicant is further directed to the rejections to claim 1 set forth above, because claim 10 comprises limitations similar to claim 1, and they are rejected based on the same rationale. Conclusion Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. Any inquiry concerning this communication or earlier communications from the examiner should be directed to DUY TU DIEP whose telephone number is (703)756-1738. The examiner can normally be reached M-F 8-4:30. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Alexey Shmatov can be reached at (571) 270-3428. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /DUY T DIEP/Examiner, Art Unit 2123 /ALEXEY SHMATOV/Supervisory Patent Examiner, Art Unit 2123
Read full office action

Prosecution Timeline

Dec 10, 2020
Application Filed
Aug 24, 2022
Response after Non-Final Action
Dec 28, 2023
Non-Final Rejection — §103
Mar 25, 2024
Response Filed
Jun 03, 2024
Final Rejection — §103
Sep 09, 2024
Response after Non-Final Action
Oct 03, 2024
Examiner Interview (Telephonic)
Dec 06, 2024
Request for Continued Examination
Dec 16, 2024
Response after Non-Final Action
Feb 12, 2025
Non-Final Rejection — §103
May 19, 2025
Response Filed
Aug 11, 2025
Final Rejection — §103
Oct 20, 2025
Response after Non-Final Action
Nov 20, 2025
Request for Continued Examination
Nov 30, 2025
Response after Non-Final Action
Jan 05, 2026
Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12579428
METHOD FOR INJECTING HUMAN KNOWLEDGE INTO AI MODELS
2y 5m to grant Granted Mar 17, 2026
Patent 12488223
FEDERATED LEARNING FOR TRAINING MACHINE LEARNING MODELS
2y 5m to grant Granted Dec 02, 2025
Patent 12412129
DISTRIBUTED SUPPORT VECTOR MACHINE PRIVACY-PRESERVING METHOD, SYSTEM, STORAGE MEDIUM AND APPLICATION
2y 5m to grant Granted Sep 09, 2025
Study what changed to get past this examiner. Based on 3 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

5-6
Expected OA Rounds
25%
Grant Probability
30%
With Interview (+5.5%)
4y 2m
Median Time to Grant
High
PTA Risk
Based on 20 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month