Last updated: April 19, 2026
Application No. 17/893,288
SYSTEM AND METHOD FOR MACHINE LEARNING ARCHITECTURE WITH MULTIPLE POLICY HEADS

Final Rejection §101§103§112
Filed
Aug 23, 2022
Examiner
GORMLEY, AARON PATRICK
Art Unit
2148
Tech Center
2100 — Computer Architecture & Software
Assignee
Royal Bank Of Canada
OA Round
2 (Final)
Interview Optional

— -60.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 5 resolved cases, 2023–2026
Examiner Intelligence

GORMLEY, AARON PATRICK View full profile →
Grants 60% of resolved cases
Career Allow Rate
3 granted / 5 resolved
+5.0% vs TC avg
Minimal -60% lift
Without
With
+-60.0%
Interview Lift
resolved cases with interview
Typical timeline
4y 4m
Avg Prosecution
30 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
30.2%
-9.8% vs TC avg
§103
36.0%
-4.0% vs TC avg
§102
8.4%
-31.6% vs TC avg
§112
21.5%
-18.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 5 resolved cases
Office Action

§101 §103 §112
DETAILED ACTION
	This action is in response to the application filed 08/23/2022. Claims 1, 3-13, and 15-18 are pending and have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 1, 3-13, and 15-18 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention.

Regarding claim 1, while the instant specification discloses “providing, to the reinforcement learning neural network, at least one reward corresponding to at least one prior resource task request generated based on outputs of the reinforcement learning neural network” in paragraph [0016] and similar limitations in paragraphs [0004], [0020], and [0091], it fails to disclose “provid[ing], to the plurality of policy heads of the reinforcement learning neural network, a corresponding plurality of rewards corresponding to at least one prior resource task request generated based on outputs of the reinforcement learning neural network”, as recited in claim 1. Thus, claim 1 contains new matter not described in the instant specification. This deficiency is present in similar independent claims 13 and 17, and is inherited by dependent claims 3-12, 15-16, and 18.

Regarding claim 18, the instant specification does not disclose “the plurality of sub-goals of the resource task requests are determined based on the state data reflecting the current state of the environment”, as recited in claim 18. Thus, claim 18 contains new matter not described in the instant specification.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claim(s) Claims 1, 3-13, and 15-18 are rejected under 35 U.S.C. 103 as being unpatentable over Burhani et al. (TRADE PLATFORM WITH REINFORCEMENT LEARNING, published 12/5/2019, US 2019/0370649 A1), hereafter referred to as Burhani, in view of Van Seijen et al. (SCALABILITY OF REINFORCEMENT LEARNING BY SEPARATION OF CONCERNS, filed 6/26/2017, US 2018/0165602 A1), hereafter referred to as Van Seijen, and further in view of Hernandez et al. (OPPONENT MODELING WITH ASYNCHRONOUS METHODS IN DEEP RL, published 5/7/2020, US 2020/0143208 A1), hereafter referred to as Hernandez.

Regarding claim 1, Burhani teaches [a] computer-implemented system for automatic generation of resource task requests, the system comprising:
a communication interface; at least one processor; memory in communication with the at least one processor; and software code stored in the memory, which when executed at the at least one processor causes the system to: In accordance with an aspect, there is provided a computer-implemented system for training an automated agent. The system includes a communication interface, at least one processor, memory in communication with the at least one processor, and software code stored in the memory. The software code, when executed at the at least one processor causes the system to: instantiate an automated agent that maintains a reinforcement learning neural network and generates, according to outputs of the reinforcement learning neural network, signals for communicating resource task requests” (Burhani, [0004]).
provid[ing], to the reinforcement learning neural network, state data reflective of a current state of an environment in which resource task requests are made: “Reinforcement learning is a category of machine learning that configures agents, such the automated agents 180 described herein, to take actions in an environment to maximize a notion of a reward. The processor 104 is configured with machine executable instructions to instantiate an automated agent 180 that maintains a reinforcement learning neural network 110 … Reward system 126 is configured to receive control the reinforcement learning network 110 to process input data (state data) in order to generate output signals. Input data may include trade orders, various feedback data (e.g., rewards), or feature selection data, or data reflective of completed tasks ( e.g., executed trades), data reflective of trading schedules, etc. Output signals may include signals for communicating resource task requests, e.g., a request to trade in a certain security” (Burhani, [0052]). Resource task requests are made in the agent environments.
generat[ing] a resource task request signal defining the resource task request and the associated plurality of task request parameters selected by the plurality of policy heads: “The software code, when executed at the at least one processor causes the system to: instantiate an automated agent that maintains a reinforcement learning neural network and generates, according to outputs (task request parameters) of the reinforcement learning neural network, signals for communicating resource task requests” (Burhani, [0004]).
	While Burhani fails to disclose the further limitations of claim 1, Van Seijen teaches a method, comprising:
maintain[ing] a reinforcement learning neural network having an output layer with a plurality of policy heads, each of the policy heads configured to be trained based on a separate reward:
“Let the reward function of the environment be                                 
                                    
                                            R
                                        
                                            e
                                            n
                                            v
                                        
                            . The target function of the deep network (neural network) can be regularized by splitting the reward function into n reward functions, weighted by                                 
                                    
                                            w
                                        
                                            i
                                        
                            : 
    PNG
    media_image1.png
    104
    550
    media_image1.png
    Greyscale
and training a separate reinforcement-learning agent on each of these reward functions” (Van Seijen, [0239]).
“Different agents can share multiple lower-level layers of a deep Q-network (reinforcement learning neural network), the collection of agents can be viewed alternatively as a single agent with multiple heads (policy heads), with each head producing the action-values (output) of the current state under a different Q,. A single vector 8 can be used for the parameters of this network. Each head can be associated with a different reward function” (Van Seijen, [0243]).
“Options are temporally-extended actions that, like HRA's heads, can be trained in parallel based on their own (intrinsic) reward functions (separate reward[s])” (Van Seijen, [0251])
provid[ing], to the plurality of policy heads of the reinforcement learning neural network, a corresponding plurality of rewards corresponding to at least one prior resource task request generated based on outputs of the reinforcement learning neural network:
“Actions a are taken at discrete time steps according to policy                                 
                                    π
                                
                            , which maps states to actions. For example, actions a may be taken at discrete time steps t=0, 1, 2, ... according to a policy                                 
                                    π
                                
                            ” (Van Seijen, [0074]). At each time step, a task request is made to select an action.
“In this example, the goal of the robot 104 is to reach each piece of fruit (resource task) 102 scattered across the possible positions 108 as quickly as possible (e.g., in the fewest possible actions (signals from resource task requests))” (Van Seijen, [0064]).
“In an example HRA model, consider a Markov decision process (MDP) that models an agent interacting with an environment at discrete time steps t. It has a state set S, an action set A, transition probability function                                 
                                    P
                                    :
                                    S
                                    ×
                                    A
                                    ×
                                    S
                                    →
                                    [
                                    0,1
                                    ]
                                
                             and environment reward function                                 
                                    
                                            R
                                        
                                            e
                                            n
                                            v
                                        
                                    :
                                    S
                                    ×
                                    A
                                    →
                                    R
                                
                            . At time step t, the agent observes states                                 
                                    
                                            s
                                        
                                            t
                                        
                                    ∈
                                    S
                                
                            , and takes action                                 
                                    
                                            a
                                        
                                            t
                                        
                                    ∈
                                    A
                                
                             (signal from prior task request). The agent observes the next state                                 
                                    
                                            s
                                        
                                            t
                                            +
                                            1
                                        
                            , drawn from the transition probability function P, and a reward                                 
                                    
                                            r
                                        
                                            t
                                        
                                    =
                                    
                                            R
                                        
                                            e
                                            n
                                            v
                                        
                                    (
                                    
                                            s
                                        
                                            t
                                        
                                    ,
                                    
                                            a
                                        
                                            t
                                        
                                    )
                                
                            .” (Van Seijen, [0235]). The value of the reward function is based on the previous action performed in response to a prior task request.
“Each head can be associated with a different reward function” (Van Seijen, [0243]).
obtain[ing] a plurality of task request parameters for a resource task request associated with a resource, each task request parameter selected by a corresponding policy head of the plurality of policy heads provided with the corresponding plurality of rewards, the plurality of task request parameters including a first task request parameter defining a quantity of the resource and a second task request parameter defining a cost of the resource; and generat[ing] a resource task request signal defining the resource task request and the associated plurality of task request parameters selected by the plurality of policy heads:
“Different agents can share multiple lower-level layers of a deep Q-network, the collection of agents can be viewed alternatively as a single agent with multiple heads (policy heads), with each head producing the action-values (task request parameters) of the current state under a different Qi” (Van Seijen, [0243]).
“An aggregator can generate or select an action (task request signal) to take with respect to the environment. This can be referred to as an environment action (task request signal) and can define a set of all possible actions that can be taken with respect to the environment. Each agent can give its values for the actions (task request parameters) of the current state to an aggregator. In an example, the aggregator can select one of the received actions as the environment action. For example, the aggregator can combine two more received action values into a single action-value for each action (for example, by averaging over all agents). Based on these action-values the current action is selected (for example, by taking the greedy action). In another example, the aggregator combines two or more of the received actions to form the environment action ( e.g., combining two actions with the highest action-values). The actions or action values received from the agents may but need not necessarily correspond to actions that can be taken with respect to the environment. For example, an environment may define possible actions as: ‘Up’ and ‘Down’, but there may be a ‘caution’ agent that, rather than describing an action to take, instead provides a warning, such as ‘Danger: Up’, which may indicate that there is danger associated with the up action. Even though ‘Danger: Up’ is not an action that can be taken with respect to the environment, an agent may nonetheless provide such an output, which can affect the action taken” (Van Seijen, [0234]).
“FIG. 1 illustrates an example layout 100 for this introductory example, including three pieces of fruit 102 (resource[s]) and the robot 104 with arrows 106 indicating potential directions of movement within a grid of possible positions 108. In this example, the goal of the robot 104 is to reach each piece of fruit (resource task) 102 scattered across the possible positions 108 as quickly as possible (e.g., in the fewest possible actions (task request signal[s])). In reinforcement learning, an agent controlling the robot 104 aims to maximize a return, Gt, which is the expected discounted sum of rewards … The possible actions (task request signal[s]) of the robot 104 include moving in different directions and a "no movement" (i.e., no-op) action” (Van Seijen, [0064]).
	Burhani and Van Seijen relate to deep reinforcement learning and are analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Burhani to generate task request signals from a plurality of policy heads with unique outputs, as disclosed by Van Seijen. Van Seijen’s multi-head framework allows for a smoother overall value function that can be approximated by a lower-dimensional representation, allowing for more effective learning. Additionally, this process reduces the overall state space, increases convergence speed, and reduces computing resources consumed. See Van Seijen, [0010] and [0072].
While Van Seijen fails to disclose the further limitations of claim 1, Hernandez teaches a method, wherein the plurality of task request parameters including a first task request parameter defining a quantity of the resource and a second task request parameter defining a cost of the resource: “Accordingly, outputs (task request parameters) are not limited to game control signals, but rather, can also include control signals for actions taken in relation to automated trading, corporate decision making, among others. These actions can control, … in the context of a stock trading platform, trade actions and their associated parameters (e.g., purchase X shares (quantity) having an order type of Y, along with derivatives Z having a strike price (cost) of W)” (Hernandez, [0015]). The plurality of task request parameters in the example includes four parameters: shares, order type, derivatives, and strike price. The resource here is securities.
	Hernandez relates to multi-head deep reinforcement learning and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Burhani and Van Seijen to produce outputs defining quantity and costs of some resource, as disclosed by Hernandez. This allows the network to perform automated trading of securities. See Hernandez, [0012] and [0015].

	Regarding claim 3, the rejection of claim 1 in view of Burhani, Van Seijen, and Hernandez is incorporated. Van Seijen further teaches a method, wherein each of the plurality of rewards is associated with a corresponding subgoal of the resource task requests:
“FIG. 1 illustrates an example layout 100 for this introductory example, including three pieces of fruit 102 (resource[s]) and the robot 104 with arrows 106 indicating potential directions of movement within a grid of possible positions 108. In this example, the goal of the robot 104 is to reach each piece of fruit (resource task / goal) 102 scattered across the possible positions 108 as quickly as possible (e.g., in the fewest possible actions). In reinforcement learning, an agent controlling the robot 104 aims to maximize a return, Gt, which is the expected discounted sum of rewards … The possible actions of the robot 104 include moving in different directions and a "no movement" (i.e., no-op) action” (Van Seijen, [0070]). The agent’s goal is to reach each piece of fruit.
“In the experiments, the performance of DQN was compared with HRA. The learning objective for DQN gave a+1 reward for each piece of fruit and used y=0.95. For HRA, the reward function was decomposed into ten different reward functions: one per possible fruit locations (sub-goal[s])” (Van Seijen, [0266]). Each of the plurality of rewards is associated with finding one particular fruit, a subgoal.
	Van Seijen relates to multi-head deep reinforcement learning and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Burhani, Van Seijen, and Hernandez to apply a reward to each of the plurality of policy heads, as disclosed by Van Seijen. Van Seijen’s multi-head framework uses a decomposed reward function, where each subfunction learns only a subset of the features. This allows for a smoother overall value function that can be approximated by a lower-dimensional representation, allowing for more effective learning. Additionally, this process reduces the overall state space, increases convergence speed, and reduces computing resources consumed. See Van Seijen, [0010] and [0072].

	Regarding claim 4, the rejection of claim 3 in view of Burhani, Van Seijen, and Hernandez is incorporated. Van Seijen further teaches a method, wherein the providing the plurality of rewards includes providing to each of the plurality of policy heads a subset of the plurality of rewards selected for that policy head: “Different agents can share multiple lower-level layers of a deep Q-network, the collection of agents can be viewed alternatively as a single agent with multiple heads (plurality of policy heads), with each head producing the action-values of the current state under a different Qi. A single vector 8 can be used for the parameters of this network. Each head can be associated with a different reward function” (Van Seijen, [0243]). 
	Van Seijen relates to multi-head deep reinforcement learning and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Burhani, Van Seijen, and Hernandez to apply a reward to each of the plurality of policy heads, as disclosed by Van Seijen. Van Seijen’s multi-head framework uses a decomposed reward function, where each subfunction learns only a subset of the features. This allows for a smoother overall value function that can be approximated by a lower-dimensional representation, allowing for more effective learning. Additionally, this process reduces the overall state space, increases convergence speed, and reduces computing resources consumed. See Van Seijen, [0010] and [0072].

	Regarding claim 5, the rejection of claim 1 in view of Burhani, Van Seijen, and Hernandez is incorporated. Burhani further teaches a method, wherein the reinforcement learning neural network is maintained in an automated agent: “The software code, when executed at the at least one processor causes the system to: instantiate an automated agent that maintains a reinforcement learning neural network” (Burhani, [0004]).

	Regarding claim 6, the rejection of claim 5 in view of Burhani, Van Seijen, and Hernandez is incorporated. Burhani further teaches a method, wherein the plurality of outputs includes at least one output defining an action to be taken by the automated agent: “The automated agent 180 generates, according to outputs of its reinforcement learning neural network, signals for communicating resource task requests (action[s] to be taken by the automated agent) for a given resource (e.g., a given security). For example, the automated agent 180 may receive a trade order for a given security as input data and then generate signals for a plurality of resource task requests corresponding to trades for child trade order slices of that security. Such signals may be communicated to a trading venue by way of communication interface 106” (Burhani, [0090]).

	Regarding claim 7, the rejection of claim 6 in view of Burhani, Van Seijen, and Hernandez is incorporated. Burhani further teaches a method, wherein the plurality of outputs includes at least one output defining a parameter of the action: “Output signals may include signals for communicating resource task requests, e.g., a request to trade in a certain security“ (Burhani, [0052]). A request to trade a security must contain some sort of parameter for it to be a proper request, such as (yes/no).

Regarding claim 8, the rejection of claim 1 in view of Burhani, Van Seijen, and Hernandez is incorporated. Van Seijen further teaches a method, wherein the generating includes combining at least two of the plurality of outputs: “A strategy for constructing a learning objective can be to decompose the reward function of the environment into n different reward functions. Each reward function can be assigned to a separate reinforcement learning agent. These agents can learn in parallel on the same sample sequence by using off-policy learning (e.g., using a Horde architecture). An aggregator can generate or select an action to take (task request) with respect to the environment. This can be referred to as an environment action and can define a set of all possible actions that can be taken with respect to the environment. Each agent can give its values for the actions of the current state to an aggregator. In an example, the aggregator can select one of the received actions as the environment action. For example, the aggregator can combine two more received action values (outputs) into a single action value for each action (for example, by averaging over all agents)” (Van Seijen, [0234]); “the aggregator uses the values of all heads to select its action” (Van Seijen, [0251]).
	Van Seijen relates to multi-head deep reinforcement learning and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Burhani, Van Seijen, and Hernandez to combine the outputs of policy heads, as disclosed by Van Seijen. Van Seijen’s multi-head framework uses a decomposed reward function, where each head subfunction learns only a subset of the features. This allows for a smoother overall value function that can be approximated by a lower-dimensional representation, allowing for more effective learning. Additionally, this process reduces the overall state space, increases convergence speed, and reduces computing resources consumed. See Van Seijen, [0010] and [0072].

	Regarding claim 9, the rejection of claim 8 in view of Burhani, Van Seijen, and Hernandez is incorporated. Van Seijen further discloses a method, wherein the output layer is interconnected with a plurality of hidden layers of the reinforcement learning neural network: “The HRA neural network 2420 includes an input layer 2422, one or more hidden layers 2424, and a plurality of heads 2426, each with their own reward function … The heads 2426 inform the output 2428 (e.g., using a linear combination) (output layer)” (Van Seijen, [0262]).
	Van Seijen relates to multi-head deep reinforcement learning and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Burhani, Van Seijen, and Hernandez to use a neural network with hidden layers, as disclosed by Van Seijen. Hidden layers are a component of the HRA network described by Van Seijen. HRA has numerous benefits over comparable methods, including greater exploitation of available domain knowledge, advantage estimating random policy, outperforming other methods and human performance, resisting getting stuck in local optima, and learning quickly. See Van Seijen, [0252], [0256], [0268], [0269], [0293], [0298], and [0299].

	Regarding claim 10, the rejection of claim 1 in view of Burhani, Van Seijen, and Hernandez is incorporated. Burhani further teaches a method, wherein the resource task request signal encodes a request to trade a security: “Output signals may include signals for communicating resource task requests, e.g., a request to trade in a certain security” (Burhani, [0052]).

	Regarding claim 12, the rejection of claim 1 in view of Burhani, Van Seijen, and Hernandez is incorporated. Burhani further teaches a method, wherein the environment includes a trading venue:  “As depicted, at each time step (t0 , t1 , . .. tn), platform 100 receives task data 300, e.g., directly from a trading venue or indirectly by way of an intermediary. Task data 300 includes data relating to tasks completed in a given time interval” (Burhani, [0069]).

Regarding claim 13, Burhani teaches [a] computer computer-implemented method for automatically generating resource task requests, the method comprising: 
providing, to the reinforcement learning neural network, state data reflective of a current state of an environment in which resource task requests are made: “Reinforcement learning is a category of machine learning that configures agents, such the automated agents 180 described herein, to take actions in an environment to maximize a notion of a reward. The processor 104 is configured with machine executable instructions to instantiate an automated agent 180 that maintains a reinforcement learning neural network 110 … Reward system 126 is configured to receive control the reinforcement learning network 110 to process input data (state data) in order to generate output signals. Input data may include trade orders, various feedback data (e.g., rewards), or feature selection data, or data reflective of completed tasks ( e.g., executed trades), data reflective of trading schedules, etc. Output signals may include signals for communicating resource task requests, e.g., a request to trade in a certain security” (Burhani, [0052]). Resource task requests are made in the agent environments.
generating a resource task request signal defining the resource task request and the associated plurality of task request parameters selected by the plurality of policy heads: “The software code, when executed at the at least one processor causes the system to: instantiate an automated agent that maintains a reinforcement learning neural network and generates, according to outputs (task request parameters) of the reinforcement learning neural network, signals for communicating resource task requests” (Burhani, [0004]).
	While Burhani fails to disclose the further limitations of claim 1, Van Seijen teaches a method, comprising:
maintaining a reinforcement learning neural network having an output layer with a plurality of policy heads, each of the policy heads configured to be trained based on a separate reward:
“Let the reward function of the environment be                                 
                                    
                                            R
                                        
                                            e
                                            n
                                            v
                                        
                            . The target function of the deep network (neural network) can be regularized by splitting the reward function into n reward functions, weighted by                                 
                                    
                                            w
                                        
                                            i
                                        
                            : 
    PNG
    media_image1.png
    104
    550
    media_image1.png
    Greyscale
and training a separate reinforcement-learning agent on each of these reward functions” (Van Seijen, [0239]).
“Different agents can share multiple lower-level layers of a deep Q-network (reinforcement learning neural network), the collection of agents can be viewed alternatively as a single agent with multiple heads (policy heads), with each head producing the action-values (output) of the current state under a different Q,. A single vector 8 can be used for the parameters of this network. Each head can be associated with a different reward function” (Van Seijen, [0243]).
“Options are temporally-extended actions that, like HRA's heads, can be trained in parallel based on their own (intrinsic) reward functions (separate reward[s])” (Van Seijen, [0251])
providing, to the plurality of policy heads of the reinforcement learning neural network, a corresponding plurality of rewards corresponding to at least one prior resource task request generated based on outputs of the reinforcement learning neural network:
“Actions a are taken at discrete time steps according to policy                                 
                                    π
                                
                            , which maps states to actions. For example, actions a may be taken at discrete time steps t=0, 1, 2, ... according to a policy                                 
                                    π
                                
                            ” (Van Seijen, [0074]). At each time step, a task request is made to select an action.
“In this example, the goal of the robot 104 is to reach each piece of fruit (resource task) 102 scattered across the possible positions 108 as quickly as possible (e.g., in the fewest possible actions (signals from resource task requests))” (Van Seijen, [0064]).
“In an example HRA model, consider a Markov decision process (MDP) that models an agent interacting with an environment at discrete time steps t. It has a state set S, an action set A, transition probability function                                 
                                    P
                                    :
                                    S
                                    ×
                                    A
                                    ×
                                    S
                                    →
                                    [
                                    0,1
                                    ]
                                
                             and environment reward function                                 
                                    
                                            R
                                        
                                            e
                                            n
                                            v
                                        
                                    :
                                    S
                                    ×
                                    A
                                    →
                                    R
                                
                            . At time step t, the agent observes states                                 
                                    
                                            s
                                        
                                            t
                                        
                                    ∈
                                    S
                                
                            , and takes action                                 
                                    
                                            a
                                        
                                            t
                                        
                                    ∈
                                    A
                                
                             (signal from prior task request). The agent observes the next state                                 
                                    
                                            s
                                        
                                            t
                                            +
                                            1
                                        
                            , drawn from the transition probability function P, and a reward                                 
                                    
                                            r
                                        
                                            t
                                        
                                    =
                                    
                                            R
                                        
                                            e
                                            n
                                            v
                                        
                                    (
                                    
                                            s
                                        
                                            t
                                        
                                    ,
                                    
                                            a
                                        
                                            t
                                        
                                    )
                                
                            .” (Van Seijen, [0235]). The value of the reward function is based on the previous action performed in response to a prior task request.
“Each head can be associated with a different reward function” (Van Seijen, [0243]).
obtaining a plurality of task request parameters for a resource task request associated with a resource, each task request parameter selected by a corresponding policy head of the plurality of policy heads provided with the corresponding plurality of rewards, the plurality of task request parameters including a first task request parameter defining a quantity of the resource and a second task request parameter defining a cost of the resource; and generating a resource task request signal defining the resource task request and the associated plurality of task request parameters selected by the plurality of policy heads:
“Different agents can share multiple lower-level layers of a deep Q-network, the collection of agents can be viewed alternatively as a single agent with multiple heads (policy heads), with each head producing the action-values (task request parameters) of the current state under a different Qi” (Van Seijen, [0243]).
“An aggregator can generate or select an action (task request signal) to take with respect to the environment. This can be referred to as an environment action (task request signal) and can define a set of all possible actions that can be taken with respect to the environment. Each agent can give its values for the actions (task request parameters) of the current state to an aggregator. In an example, the aggregator can select one of the received actions as the environment action. For example, the aggregator can combine two more received action values into a single action-value for each action (for example, by averaging over all agents). Based on these action-values the current action is selected (for example, by taking the greedy action). In another example, the aggregator combines two or more of the received actions to form the environment action ( e.g., combining two actions with the highest action-values). The actions or action values received from the agents may but need not necessarily correspond to actions that can be taken with respect to the environment. For example, an environment may define possible actions as: ‘Up’ and ‘Down’, but there may be a ‘caution’ agent that, rather than describing an action to take, instead provides a warning, such as ‘Danger: Up’, which may indicate that there is danger associated with the up action. Even though ‘Danger: Up’ is not an action that can be taken with respect to the environment, an agent may nonetheless provide such an output, which can affect the action taken” (Van Seijen, [0234]).
“FIG. 1 illustrates an example layout 100 for this introductory example, including three pieces of fruit 102 (resource[s]) and the robot 104 with arrows 106 indicating potential directions of movement within a grid of possible positions 108. In this example, the goal of the robot 104 is to reach each piece of fruit (resource task) 102 scattered across the possible positions 108 as quickly as possible (e.g., in the fewest possible actions (task request signal[s])). In reinforcement learning, an agent controlling the robot 104 aims to maximize a return, Gt, which is the expected discounted sum of rewards … The possible actions (task request signal[s]) of the robot 104 include moving in different directions and a "no movement" (i.e., no-op) action” (Van Seijen, [0064]).
Burhani and Van Seijen relate to deep reinforcement learning and are analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Burhani to generate task request signals from a plurality of policy heads with unique outputs, as disclosed by Van Seijen. Van Seijen’s multi-head framework allows for a smoother overall value function that can be approximated by a lower-dimensional representation, allowing for more effective learning. Additionally, this process reduces the overall state space, increases convergence speed, and reduces computing resources consumed. See Van Seijen, [0010] and [0072].
While Van Seijen fails to disclose the further limitations of claim 1, Hernandez teaches a method, wherein the plurality of task request parameters including a first task request parameter defining a quantity of the resource and a second task request parameter defining a cost of the resource: “Accordingly, outputs (task request parameters) are not limited to game control signals, but rather, can also include control signals for actions taken in relation to automated trading, corporate decision making, among others. These actions can control, … in the context of a stock trading platform, trade actions and their associated parameters (e.g., purchase X shares (quantity) having an order type of Y, along with derivatives Z having a strike price (cost) of W)” (Hernandez, [0015]). The plurality of task request parameters in the example includes four parameters: shares, order type, derivatives, and strike price. The resource here is securities.
	Hernandez relates to multi-head deep reinforcement learning and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Burhani and Van Seijen to produce outputs defining quantity and costs of some resource, as disclosed by Hernandez. This allows the network to perform automated trading of securities. See Hernandez, [0012] and [0015].

	All limitations of claims 15-16 are disclosed by claims 3-4. Thus, the analysis of claims 15-16 mirrors that of claims 3-4 and claims 15-16 are rejected under the same rationale used for claims 3-4.

Regarding claim 17, Burhani teaches [a] non-transitory computer-readable storage medium storing instructions which when executed adapt at least one computing device to [perform operations]: “Throughout the foregoing discussion, numerous references will be made regarding servers, services, interfaces, portals, platforms, or other systems formed from computing devices. It should be appreciated that the use of such terms is deemed to represent one or more computing devices having at least one processor configured to execute software instructions stored on a computer readable tangible, non-transitory medium. For example, a server can include one or more computers operating as a web server, database server, or other type of computer server in a manner to fulfill described roles, responsibilities, or functions” (Burhani, [0193]).
Burhani further teaches a method, comprising:
provid[ing], to the reinforcement learning neural network, state data reflective of a current state of an environment in which resource task requests are made: “Reinforcement learning is a category of machine learning that configures agents, such the automated agents 180 described herein, to take actions in an environment to maximize a notion of a reward. The processor 104 is configured with machine executable instructions to instantiate an automated agent 180 that maintains a reinforcement learning neural network 110 … Reward system 126 is configured to receive control the reinforcement learning network 110 to process input data (state data) in order to generate output signals. Input data may include trade orders, various feedback data (e.g., rewards), or feature selection data, or data reflective of completed tasks ( e.g., executed trades), data reflective of trading schedules, etc. Output signals may include signals for communicating resource task requests, e.g., a request to trade in a certain security” (Burhani, [0052]). Resource task requests are made in the agent environments.
generat[ing] a resource task request signal defining the resource task request and the associated plurality of task request parameters selected by the plurality of policy heads: “The software code, when executed at the at least one processor causes the system to: instantiate an automated agent that maintains a reinforcement learning neural network and generates, according to outputs (task request parameters) of the reinforcement learning neural network, signals for communicating resource task requests” (Burhani, [0004]).
	While Burhani fails to disclose the further limitations of claim 1, Van Seijen teaches a method, comprising:
maintain[ing] a reinforcement learning neural network having an output layer with a plurality of policy heads, each of the policy heads configured to be trained based on a separate reward:
“Let the reward function of the environment be                                 
                                    
                                            R
                                        
                                            e
                                            n
                                            v
                                        
                            . The target function of the deep network (neural network) can be regularized by splitting the reward function into n reward functions, weighted by                                 
                                    
                                            w
                                        
                                            i
                                        
                            : 
    PNG
    media_image1.png
    104
    550
    media_image1.png
    Greyscale
and training a separate reinforcement-learning agent on each of these reward functions” (Van Seijen, [0239]).
“Different agents can share multiple lower-level layers of a deep Q-network (reinforcement learning neural network), the collection of agents can be viewed alternatively as a single agent with multiple heads (policy heads), with each head producing the action-values (output) of the current state under a different Q,. A single vector 8 can be used for the parameters of this network. Each head can be associated with a different reward function” (Van Seijen, [0243]).
“Options are temporally-extended actions that, like HRA's heads, can be trained in parallel based on their own (intrinsic) reward functions (separate reward[s])” (Van Seijen, [0251])
provid[ing], to the plurality of policy heads of the reinforcement learning neural network, a corresponding plurality of rewards corresponding to at least one prior resource task request generated based on outputs of the reinforcement learning neural network:
“Actions a are taken at discrete time steps according to policy                                 
                                    π
                                
                            , which maps states to actions. For example, actions a may be taken at discrete time steps t=0, 1, 2, ... according to a policy                                 
                                    π
                                
                            ” (Van Seijen, [0074]). At each time step, a task request is made to select an action.
“In this example, the goal of the robot 104 is to reach each piece of fruit (resource task) 102 scattered across the possible positions 108 as quickly as possible (e.g., in the fewest possible actions (signals from resource task requests))” (Van Seijen, [0064]).
“In an example HRA model, consider a Markov decision process (MDP) that models an agent interacting with an environment at discrete time steps t. It has a state set S, an action set A, transition probability function                                 
                                    P
                                    :
                                    S
                                    ×
                                    A
                                    ×
                                    S
                                    →
                                    [
                                    0,1
                                    ]
                                
                             and environment reward function                                 
                                    
                                            R
                                        
                                            e
                                            n
                                            v
                                        
                                    :
                                    S
                                    ×
                                    A
                                    →
                                    R
                                
                            . At time step t, the agent observes states                                 
                                    
                                            s
                                        
                                            t
                                        
                                    ∈
                                    S
                                
                            , and takes action                                 
                                    
                                            a
                                        
                                            t
                                        
                                    ∈
                                    A
                                
                             (signal from prior task request). The agent observes the next state                                 
                                    
                                            s
                                        
                                            t
                                            +
                                            1
                                        
                            , drawn from the transition probability function P, and a reward                                 
                                    
                                            r
                                        
                                            t
                                        
                                    =
                                    
                                            R
                                        
                                            e
                                            n
                                            v
                                        
                                    (
                                    
                                            s
                                        
                                            t
                                        
                                    ,
                                    
                                            a
                                        
                                            t
                                        
                                    )
                                
                            .” (Van Seijen, [0235]). The value of the reward function is based on the previous action performed in response to a prior task request.
“Each head can be associated with a different reward function” (Van Seijen, [0243]).
obtain[ing] a plurality of task request parameters for a resource task request associated with a resource, each task request parameter selected by a corresponding policy head of the plurality of policy heads provided with the corresponding plurality of rewards, the plurality of task request parameters including a first task request parameter defining a quantity of the resource and a second task request parameter defining a cost of the resource; and generat[ing] a resource task request signal defining the resource task request and the associated plurality of task request parameters selected by the plurality of policy heads:
“Different agents can share multiple lower-level layers of a deep Q-network, the collection of agents can be viewed alternatively as a single agent with multiple heads (policy heads), with each head producing the action-values (task request parameters) of the current state under a different Qi” (Van Seijen, [0243]).
“An aggregator can generate or select an action (task request signal) to take with respect to the environment. This can be referred to as an environment action (task request signal) and can define a set of all possible actions that can be taken with respect to the environment. Each agent can give its values for the actions (task request parameters) of the current state to an aggregator. In an example, the aggregator can select one of the received actions as the environment action. For example, the aggregator can combine two more received action values into a single action-value for each action (for example, by averaging over all agents). Based on these action-values the current action is selected (for example, by taking the greedy action). In another example, the aggregator combines two or more of the received actions to form the environment action ( e.g., combining two actions with the highest action-values). The actions or action values received from the agents may but need not necessarily correspond to actions that can be taken with respect to the environment. For example, an environment may define possible actions as: ‘Up’ and ‘Down’, but there may be a ‘caution’ agent that, rather than describing an action to take, instead provides a warning, such as ‘Danger: Up’, which may indicate that there is danger associated with the up action. Even though ‘Danger: Up’ is not an action that can be taken with respect to the environment, an agent may nonetheless provide such an output, which can affect the action taken” (Van Seijen, [0234]).
“FIG. 1 illustrates an example layout 100 for this introductory example, including three pieces of fruit 102 (resource[s]) and the robot 104 with arrows 106 indicating potential directions of movement within a grid of possible positions 108. In this example, the goal of the robot 104 is to reach each piece of fruit (resource task) 102 scattered across the possible positions 108 as quickly as possible (e.g., in the fewest possible actions (task request signal[s])). In reinforcement learning, an agent controlling the robot 104 aims to maximize a return, Gt, which is the expected discounted sum of rewards … The possible actions (task request signal[s]) of the robot 104 include moving in different directions and a "no movement" (i.e., no-op) action” (Van Seijen, [0064]).
	Burhani and Van Seijen relate to deep reinforcement learning and are analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Burhani to generate task request signals from a plurality of policy heads with unique outputs, as disclosed by Van Seijen. Van Seijen’s multi-head framework allows for a smoother overall value function that can be approximated by a lower-dimensional representation, allowing for more effective learning. Additionally, this process reduces the overall state space, increases convergence speed, and reduces computing resources consumed. See Van Seijen, [0010] and [0072].
While Van Seijen fails to disclose the further limitations of claim 1, Hernandez teaches a method, wherein the plurality of task request parameters including a first task request parameter defining a quantity of the resource and a second task request parameter defining a cost of the resource: “Accordingly, outputs (task request parameters) are not limited to game control signals, but rather, can also include control signals for actions taken in relation to automated trading, corporate decision making, among others. These actions can control, … in the context of a stock trading platform, trade actions and their associated parameters (e.g., purchase X shares (quantity) having an order type of Y, along with derivatives Z having a strike price (cost) of W)” (Hernandez, [0015]). The plurality of task request parameters in the example includes four parameters: shares, order type, derivatives, and strike price. The resource here is securities.
	Hernandez relates to multi-head deep reinforcement learning and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Burhani and Van Seijen to produce outputs defining quantity and costs of some resource, as disclosed by Hernandez. This allows the network to perform automated trading of securities. See Hernandez, [0012] and [0015].
Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Burhani et al. (TRADE PLATFORM WITH REINFORCEMENT LEARNING, published 12/5/2019, US 2019/0370649 A1), hereafter referred to as Burhani, in view of Van Seijen et al. (SCALABILITY OF REINFORCEMENT LEARNING BY SEPARATION OF CONCERNS, filed 6/26/2017, US 2018/0165602 A1), hereafter referred to as Van Seijen, and further in view of Hernandez et al. (OPPONENT MODELING WITH ASYNCHRONOUS METHODS IN DEEP RL, published 5/7/2020, US 2020/0143208 A1), hereafter referred to as Hernandez, and Baldacci et al. (Market making and incentives design in the presence of a dark pool: a deep reinforcement learning approach, 2019, arXiv:1912.01129v), hereafter referred to as Baldacci.

	Regarding claim 11, the rejection of claim 10 in view of Burhani, Van Seijen, and Hernandez is incorporated. While Burhani, Van Seijen, and Hernandez fail to disclose the further limitations of the claim, Baldacci discloses a method, wherein the plurality of outputs includes at least one output indicating whether the request to trade a security should be made in a lit pool or a dark pool:
“To our knowledge, most of studies treat the issue of trading in dark pools mainly from the point of view of optimal liquidation: a trader wishing to buy or sell a large number of shares of one or several stocks (securit[ies]) and needing to find an optimal order placement strategy between the lit and dark pools, see for example [15]. In this paper, we rather focus on the behavior of a market maker, acting on both lit and dark venues” (Baldacci, page 2, paragraph 4).
“We now turn to the description of our numerical method to solve (4.17), the optimization procedure consists of two stages. At the first stage, we optimize the controls of the market maker for all possible values of the incentives given by the exchange. At the second stage, we use an actor-critic approach, to obtain both the optimal controls and the value function of the exchange” (Baldacci, page 15, paragraph 2)
“The first step to tackle our principal-agent problem is to find optimal volumes                                 
                                    
                                            L
                                        
                                            *
                                        
                            ” (Baldacci, page 15, paragraph 5)
“We approximate the best response function                                 
                                    
                                            L
                                        
                                            *
                                        
                             by a neural network                                 
                                    l
                                    [
                                    
                                            ω
                                        
                                            l
                                        
                                    ]
                                
                             … The neural network                                 
                                    l
                                    [
                                    
                                            ω
                                        
                                            l
                                        
                                    ]
                                
                             takes as inputs principal’s incentives and the market maker’s current inventory” (Baldacci, page 16, paragraph 3)
“We now move to the problem of the principal” (Baldacci, page 18, paragraph 2)
“We use an algorithm known in reinforcement learning literature as the actor-critic method. The core of this approach is the representation of the value function and optimal controls with deep neural networks. The learning procedure itself consists of two stages: value function update (also called critic update) and controls update (actor update)” (Baldacci, page 18, paragraph 5)
“5.2 Numerical Results” (Baldacci, page 20, paragraph 1)

    PNG
    media_image2.png
    663
    907
    media_image2.png
    Greyscale
(Baldacci, page 20, Figure 3). This figure shows optimal bid / ask volumes for lit / dark pools as an output of this model.
“One can see that the market maker splits his orders equitably between the lit and dark pools when his inventory is near zero. However, when he has a very positive (resp. negative) inventory, he has a large imbalance on the ask (resp. bid) side of the lit pool, to liquidate his position in the dark pool” (Baldacci, page 20, paragraph 4)
Examiner’s note: This model determines how many trades should be conducted in a lit vs. a dark pool.
	Baldacci relates to deep reinforcement learning for market trading and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Burhani, Van Seijen, and Hernandez to output information indicating whether a trade should be conducted in a lit or dark pool, as disclosed by Baldacci. Dark pools have gained a significant market share over traditional lit pools, with many major exchanges having both dark pools and lit pools. Baldacci’s method efficiently approximates both the optimal controls of a market maker trying to buy and sell in both pools, and the optimal incentives of the dual-pool exchange, determining the optimal trading volume for each type of pool for the market maker. See Baldacci, Abstract; page 2, paragraph 2; page 2, paragraph 4; page 3 paragraph 2.

	Regarding claim 18, the rejection of claim 3 in view of Burhani, Van Seijen, and Hernandez is incorporated. Van Seijen further teaches a method, wherein the plurality of sub-goals of the resource task requests are determined based on the state data reflecting the current state of the environment: “For HRA, the reward function was decomposed into ten different reward functions: one per possible fruit locations (sub-goal[s]). The network included an input layer of length 110, encoding the agent's position (state data) and whether there is a piece of fruit on each location (state data)” (Van Seijen, [0266])
	Van Seijen relates to multi-head deep reinforcement learning and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Burhani, Van Seijen, and Hernandez to determine sub-goals based on current state data, as disclosed by Van Seijen. Van Seijen’s multi-head framework uses a decomposed reward function, where each subfunction learns only a subset of the features. By earning different rewards for traveling to empty spaces versus collecting fruits, the network can constantly learn via backpropagation. Van Seijen’s system allows for a smoother overall value function that can be approximated by a lower-dimensional representation, allowing for more effective learning. Additionally, this process reduces the overall state space, increases convergence speed, and reduces computing resources consumed. See Van Seijen, [0010], [0072], and [0264].

Response to Arguments
	The following responses address arguments and remarks made in the instant remarks dated 10/29/2025.

Objections
	In light of the instant amendments, objections to the specification have been withdrawn.

101 Rejections
On pages 8-9 of the instant remarks, the Applicant argues that generating a resource task request signal, as claimed, cannot be performed as a mental process:
“Step 2A Prong 1: The claims do not recite a judicial exception

The Office Action states that the limitations of independent claims 1, 13 and 17 can be performed as a mental process. In particular, the Office Action states that the limitation "generate a resource task request signal based on the plurality of outputs from the plurality of policy heads" previously recited in claims 1, 13 and 17 recites a judicial exception. Applicant disagrees.

As recited in claim 1, the resource task request signal generated defines the resource task request and the associated plurality of task request parameters that are selected by policy heads trained based on separate rewards. Applicant submits that the generation of the resource task request signal cannot be performed as a mental process since it involves a specific reinforcement learning architecture and the obtention of task request parameters from policy heads trained using rewards.

Applicant further submits that the human mind is not equipped to maintain a reinforcement learning neural network that includes a plurality of policy heads that are each configured to be trained based on a separate reward, provide to the policy heads rewards corresponding to at least one prior resource task request generated based on output of the reinforcement learning neural network, or obtain task request parameters for a resource task request associated with a resource.

The Applicant submits that since the claims contain limitations that cannot be practically performed in the human mind, the claims are not directed to a judicial exception.

The Applicant draws the Office's attention to the August 4, 2025 Memorandum by Deputy Commissioner for Patents Charles Kim (the "August 4 memo"). The August 4 memo states "a claim does not recite a mental process when it contains limitation(s) that cannot practically be performed in the human mind, for instance when the human mind is not equipped to perform the claim limitation(s). The mental process grouping is not without limits. Examiners are reminded not to expand this grouping in a manner that encompasses claim limitations that cannot practically be performed in the human mind".

Accordingly, the Applicant submits that the claims are not directed to a judicial exception.”
The Applicant’s arguments above have been fully considered and are persuasive. Consequently, rejections under 35 U.S.C. 101 for claims 1, 13, and 17 have been withdrawn. As none of the dependent claims recite further judicial exceptions, their rejections under 35 U.S.C. 101 are withdrawn as well.

On page 9 of the instant remarks, the Applicant argues that the claims are integrated into a practical application through an improvement on technology:
“
    PNG
    media_image3.png
    548
    878
    media_image3.png
    Greyscale
”
	The Applicant’s arguments above are moot in view of resource task request signal generation no longer being classified as a mental process, and consequently all rejections under 35 U.S.C. 101 being withdrawn.

On pages 9-10 of the instant remarks, the Applicant argues that the claims amount to significantly more than the recited judicial exception:
“Step 2B: The claims amount to significantly more than the alleged exception

As stated above, the Applicant submits that the claims do not recite a judicial exception. However, in the case the Office disagrees, the Applicant submits that the claims amount to significantly more than the alleged exception.

Claim 1 recites specific steps to provide, to the plurality of policy heads of the reinforcement learning neural network, a corresponding plurality of rewards corresponding to at least one prior resource task request generated based on outputs of the reinforcement learning neural network; provide, to the reinforcement learning neural network, state data reflective of a current state of an environment in which resource task requests are made; obtain a plurality of task request parameters for a resource task request associated with a resource, each task request parameter selected by a corresponding policy head of the plurality of policy heads provided with the corresponding plurality of rewards, the plurality of outputs including a first output the plurality of task request parameters including a first task request parameter defining a quantity of the resource and a second task request parameter output defining a cost of the resource; and generate a resource task request signal defining the resource task request based on and the associated plurality of outputs task request parameters selected by the plurality of policy heads.

The Applicant submits that these limitations constrain the scope of the claims, since they require a specific sequence of steps, such that the claims amount to significantly more than the alleged exception. For example, the claims recite that the generation of a resource task request signal defines the resource task request and the associated plurality of task request parameters. As recited, the task request parameters are obtained using policy heads, which are provided with rewards. Accordingly, the claims involve a specific sequence of steps for the resource task request signal to be generated.

The Applicant further submits that since the claimed subject matter provides an improvement to a technology, the claimed subject matter amounts to significantly more than the alleged judicial exception. The Applicant further notes the August 4 Memo's reminders: "Examiners are reminded that if it is a "close call" as to whether a claim is eligible, they should only make a rejection when it is more likely than not (i.e., more than 50%) that the claim is ineligible under 35 U.S.C. 101. A rejection of a claim should not be made simply because an examiner is uncertain as to the claim's eligibility.”
	The Applicant’s arguments above are moot in view of resource task request signal generation no longer being classified as a mental process, and consequently all rejections under 35 U.S.C. 101 being withdrawn.

103 Rejections
On page 10-11 of the instant remarks, the Applicant argues that Burhani, Van Seijen, Hernandez do not disclose all features of the independent claims:
“Applicant has amended independent claims 1, 13 and 17 as shown herein. Applicant submits that it is not clear how Burhani or Hernandez taken alone, or in combination, discloses the features of "obtain a plurality of ask request parameters for a resource task request associated with a resource, each task request parameter selected by a corresponding policy head of the plurality of policy heads provided with the corresponding plurality of rewards, the plurality of outputs including a first output the plurality of task request parameters including a first task request parameter defining a quantity of the resource and a second task request parameter output defining a cost of the resource; and generate a resource task request signal defining the resource task request and the associated plurality of outputs task request parameters selected by the plurality of policy heads," recited in claims 1, 13 and 17.

Applicant further submits that Van Seijen does not disclose at least these features of independent claims 1, 13 and 17. Van Seijen is directed to decomposing a single-agent learning problem into simpler problems addressed by multiple agents and aggregating, using an aggregator, the different solutions from the multiple agents to determine an action to take with respect to an environment (see Abstract).

As shown in FIG. 2 of Van Seijen, reproduced below, for each agent 1, 2, an environment action set B1, B2 and a communication action-set C1, C2 are defined. Environment actions for a robot, can be for example, "up", "down", "left", "right". An action-mapping function f: B1 x B2 -> A, is then used to determine an action A for the agent (see paragraph [0075]). The action mapping function can be an aggregator that selects one of the received actions (B1, B2) or combines the received actions as the environment action A.

In other words, Van Seijen discloses determining a single action for an agent by selecting or combining outputs from each sub-agent (sub-solutions).

    PNG
    media_image4.png
    490
    421
    media_image4.png
    Greyscale

The Applicant submits that the environment action sets B1, B2 cannot be analogized to task request parameters of a resource task request since they constitute sets of actions from which a final environment action can be selected or generated, rather than parameters of a same resource task request. Further, as recited in claim 1, the resource task request signal generated defines the resource task request and the associated task request parameters. In contrast, the environment action generated by Van Seijen only defines the action to be taken by the agent.”
In response to the Applicant's argument that Burhani and Hernandez fail to disclose limitations of the amended independent claims, the Examiner notes that one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references.  See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986). Van Seijen discloses a reinforcement learning network where task request actions (resource task request parameters) are selected by a plurality of policy heads (Van Seijen, [0243]) that are rewarded with individual reward functions. Van Seijen further discloses generating an agent action (resource task request signal) defining the request and output parameters (Van Seijen, [0239]). While Van Seijen fails to disclose parameters defining quantity or cost of a resource, this deficiency is remedied by Hernandez, which discloses a reinforcement learning system where network outputs (task request parameters) can define quantities of a resource and costs of a resource (Hernandez, [0015]). It would have been obvious for one of ordinary skill in the art to produce outputs defining resource quantity and cost in the system disclosed by Burhani and Van Seijen, as this would allow the network to perform automated trading of securities, noted by Hernandez ([0012], [0015]). Thus, no rejections are withdrawn on these grounds.
Regarding the argument that Van Seijen fails to disclose particular features of the claims, the Examiner respectfully disagrees. Intermediate actions produced by each head of the network (environment action sets included), which are used as input for an aggregated action function, are parameters of that aggregate function (Van Seijen, [0081]). In the case of Van Seijen, the aggregated action output is an agent control action (signal) in the service of the action task requested at that time step (task request) (Van Seijen, [0074], [0081]). By selecting an agent action to perform, the task request is being defined with concrete, performable action. Van Seijen makes it clear that task requests (and by extension, actions performed in response to them) can be directed toward acquiring resources, such as collecting pellets in Pac-man or collecting fruit (Van Seijen, [0140], [0266]).
	The combination of Burhani, Van Seijen, and Hernandez discloses the system of amended claim 1, commensurate in scope with its limitations. Similar arguments apply to substantially similar independent claims 13 and 17. See the 103 rejections section for more detail. Accordingly, no rejections are withdrawn on this basis.

On page 11 of the instant remarks, the Applicant argues that Baldacci does not disclose all limitations of the claimed invention:
“For completeness, the Applicant submits that Baldacci does not disclose the features noted above. Baldacci describes a market maker acting in the lit and dark pools of an exchange. Baldacci does not disclose a neural network having an output layer with a plurality of policy heads and accordingly cannot disclose obtaining task request parameters for a resource task request associated with a resource, each selected by a corresponding policy head and generating a resource task request signal defining the resource task request based on the task request parameters selected by the policy heads.”
In response to the Applicant's argument that Baldacci fails to disclose limitations of claim 11, the Examiner notes that one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references.  See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986). As discussed regarding parent claim 1 and in arguments above, Van Seijen discloses a neural network with an output layer containing a plurality of policy heads (Van Seijen, [0243]), as well as obtaining resource task parameters selected by corresponding policy heads (Van Seijen, [0234], [0243]) for a resource task request (Van Seijen, [0064]). Van Seijen further discloses generating a resource task request signal, based on the task request parameters, that defines the resource task request (Van Seijen, [0234]).
While Van Seijen fails to disclose outputs indicating whether to trade securities in lit or dark pools, this limitation is disclosed by Baldacci (Baldacci, page 20, Figure 3 & paragraph 4). It would have been obvious for one of ordinary skill in the art to make the system disclosed by the combination of Burhani, Van Seijen, and Hernandez to output information indicating whether a trade should be conducted in a lit or dark pool, as dark pools have a significant market share over lit pools, and many security exchanges have dark and lit pools. These outputs can be used to efficiently approximate optimal controls of market makers trying to buy / sell in both pools and controls of changes determining optimal volumes for each, as noted by Baldacci, Abstract; page 2, paragraph 2; page 2, paragraph 4; and page 3, paragraph 2. See the 103 rejections section for more detail. Thus, no rejections are withdrawn on these grounds.	

On pages 11 of the instant remarks, the Applicant argues that dependent claims are non-obvious at least by virtue of dependence:
“Accordingly, Applicant submits that the cited references, whether considered alone or in combination, do not disclose the features of independent claims 1, 13 and 17. Further the Applicant submits that the dependent claims are also novel and non-obvious over the cited references at least by virtue of their dependencies on independent claims 1 and 13.”
	As noted in argument above and in the 103 rejections section, the independent claims are still found to be obvious over the prior art. Thus, no rejections for dependent claims are withdrawn on this basis.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Laruelle et al. (Optimal split of orders across liquidity pools: a stochastic algorithm approach, 2010, arXiv:0910.1166v3) teaches a method of determining an optimal way to split trades across lit and dark pools
Kartal et al. (SYSTEM AND METHOD FOR DEEP REINFORCEMENT LEARNING, published 5/7/2020, US 2020/0143206 A1) teaches a method of using a parallelized asynchronous multi-head reinforcement learning model

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Aaron P Gormley whose telephone number is (571)272-1372. The examiner can normally be reached Monday - Friday 12:00 PM - 8:00 PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michelle T Bechtold can be reached at (571) 431-0762. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/AG/Examiner, Art Unit 2148                                                                                                                                                                                                        /MICHELLE T BECHTOLD/Supervisory Patent Examiner, Art Unit 2148
Read full office action
Prosecution Timeline

Aug 23, 2022
Application Filed
Jul 23, 2025
Non-Final Rejection — §101, §103, §112
Oct 29, 2025
Response Filed
Dec 17, 2025
Final Rejection — §101, §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/537,475
Patent 12585955
Minimal Trust Data Sharing
2y 5m to grant Granted Mar 24, 2026
17/524,338
Patent 12579440
Training Artificial Neural Networks Using Context-Dependent Gating with Weight Stabilization
2y 5m to grant Granted Mar 17, 2026
Study what changed to get past this examiner. Based on 2 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
60%
Grant Probability
With Interview (-60.0%)
4y 4m
Median Time to Grant
Moderate
PTA Risk
Based on 5 resolved cases by this examiner. Grant probability derived from career allow rate.