Office Action Analysis: 17837882 — SYSTEM AND METHOD FOR RISK SENSITIVE REINFORCEMENT LEARNING ARCHITECTURE

Examiner Intelligence

LAHAM BAUZO, ALVARO SALIM View full profile →
Grants only 25% of cases
Career Allowance Rate
1 granted / 4 resolved
-30.0% vs TC avg
Strong +100% interview lift
Without
With
+100.0%
Interview Lift
resolved cases with interview
Typical timeline
4y 1m
Avg Prosecution
17 currently pending
Career history
30
Total Applications
across all art units
Statute-Specific Performance

§101
3.5%
-36.5% vs TC avg
§103
96.6%
+56.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 4 resolved cases
Office Action

§101 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Amendments
This Office Action is in response to the amendment filed on January 12, 2026. 
Claim(s) 1-2, 4, 11-12, and 14 have been amended. 
No claims have been cancelled.
Claim 21 has been added. 
The objections and rejections from the prior correspondence that are not restated herein are withdrawn.

Response to Arguments
Applicant's arguments filed on January 12, 2026 have been fully considered.
Applicant's arguments regarding the 35 U.S.C. 101 rejections of the previous office action have been fully considered but are not persuasive. Applicant argues:
“Applicant further submits that the human mind is not equipped to reduce expected reward variance by updating, in parallel, a sampled subset of initialized learning tables, since, as recited, each updated learning table is obtained by updating an initialized learning table using a utility function.
The Applicant submits that since the limitations of the claims cannot be practically performed in the human mind, the claims are not directed to a judicial exception.”
Examiner agrees that updating learning tables in parallel cannot be practically performed in the human mind. However, updating the learning tables involves the use of the utility function, and thus involves mathematical calculations. For this reason, independent claims 1 and 11 remain directed to a judicial exception, as shown in detail in the 101 rejections below.
Applicant further argues:
“Applicant submits that the claimed subject matter effects an improvement to the technology of a machine learning model. As described in the Applicant's disclosure, the claimed subject matter can reduce the training variance of a reinforcement learning network, by averaging multiple Q tables. The reduction in training variance can enable the reinforcement learning neural network to converge faster, to improve generalization and to become a better model (see e.g., paragraph [0132] application as published). Since actual training time can be limited, reducing training variance can enable convergence within the limited training time.
Further, the initialized learning tables Qi are updated in parallel, enabling the process of updating learning tables to be achieved more quickly.
Further, as explained, averaging the updated Q tables can enable more stable actions to be selected. The Applicant submits that reducing training variance, improving the performance of the reinforcement neural network and improving convergence constitute improvements to the neural network of the trained agent.”
	Examiner respectfully disagrees. It should be noted that the improvement must be provided by the additional element, not by limitations directed to a judicial exception. The reduction in training variance enabling faster model convergence and improved generalization is directed to mathematical calculations since the reduction in training variance is a result of updating the learning tables by using the utility function. For the same reason, updating the learning tables in parallel and achieving the updating process more quickly constitute an improvement in the mathematical calculations recited in the claim. 
Applicant further argues:
“Applicant submits that the claims amount to significantly more than the alleged exception.
Claim 1 recites specific steps to generate a signal for communicating one or more resource task requests. As recited in claim 1, signal for communicating one or more resource task requests is generated based on the averaged learning table Q', which itself is generated based on the updated learning tables and the initialized learning tables outside the subset (i.e., the non-updated initialized learning tables). The plurality of updated learning tables are obtained based on the initialized learning tables Qi and using a utility function.
The Applicant submits that these limitations constrain the scope of the claims, since they require a specific sequence of steps, such that the claims amount to significantly more than the alleged exception.”
Examiner respectfully disagrees. The sequence of steps that result in an output signal for communicating one or more resource task requests are a specific sequence of mathematical computation steps that involves updating the learning tables. It should be noted that a specific ordering of process steps and/or mathematical operations may help applicants in distinguishing the claimed invention from prior art (e.g., non-obviousness), but a specific ordering of mathematical calculations does not amount to significantly more.
Applicant's arguments regarding the 35 U.S.C. 103 rejections of the previous office action have been fully considered but are not persuasive. Applicant argues:
“Pan only discloses that k action-value outputs             
                
                        Q
                    
                        1
                    
        (s,• ),.,             
                
                        Q
                    
                        k
                    
        (s,•) are obtained via k different heads and does not explain how the k different heads calculate the action-value outputs. The Applicant submits that at best, Pan discloses that all initialized action value outputs Q are updated. Pan does not disclose updating, in parallel, a sampled subset of initialized learning tables.”
Examiner respectfully disagrees. PAN [page 2, section I. Introduction]  teaches that actions with low variance of the expected reward should be selected. PAN page 3, section B] teaches estimating the variance by training Q networks in parallel. Furthermore, PAN [page 3, section B] teaches: “After this, the input is passed to             
                k
            
         different heads which perform one dense layer to obtain             
                k
            
         action-value outputs:             
                
                                Q
                            
                                1
                            
                                s
                                ,
                                ⋅
                            
                        ,
                        …
                        ,
                        
                                Q
                            
                                k
                            
                                s
                                ,
                                ⋅
                            
        .” Additionally, PAN [page 3, Figure 2] teaches: “The resulting 512-dimensional vector is copied and passed to k branches, which each process it through dense layers to obtain a state value vector             
                
                        Q
                    
                        i
                    
                        s
                        ,
                        ⋅
                    
        .” Performing one dense layer is how PAN teaches calculating the action-value outputs. Specifically, PAN teaches:
reduce expected reward variance by updating, in parallel, a sampled subset of the plurality of initialized learning tables                     
                        
                                Q
                            
                                i
                            
                 to obtain a plurality of updated learning tables using a […] function, […]; (PAN [page 2, section I. Introduction] teaches: “A robust policy should not only maximize long term expected reward, but should also select actions with low variance of that expected reward (i.e., reduce expected reward variance). Maximizing the expectation of the value function only maximizes the point estimate of that function without giving a guarantee on the variance.” PAN [page 3, B. Reward Design and Risk Modeling] teaches: “The risk of an action can be modeled by estimating the variance of the value function across different models trained on different sets of data. Inspired by [14], we estimate the variance of Q value functions by training multiple Q value networks in parallel (i.e., by updating in parallel).” PAN [page 4, section C. Risk Averse RARL] teaches: “The use of masks in updating Q value functions is similar to [14], where the mask is a integer vector of size equal to batch size times number of ensemble Q networks, and is used to determine which model is to be updated with the sample batch (i.e., updating […] a sampled subset of the plurality of initialized learning tables                     
                        
                                Q
                            
                                i
                            
                 to obtain a plurality of updated learning tables).” PAN [page 3, B. Reward Design and Risk Modeling] teaches: " When updating Q functions, our algorithm (like DQN [2]) samples a batch of data of size B from the replay buffer                     
                        
                                                        s
                                                        ,
                                                        a
                                                        ,
                                                        
                                                                s
                                                            
                                                                '
                                                            
                                                        ,
                                                        r
                                                        ,
                                                        d
                                                        o
                                                        n
                                                        e
                                                    
                                                t
                                            
                                t
                                =
                                1
                            
                                B
                            
                 which, for each data point, includes the state, action, next state, reward, and task completion signal.” PAN [page 1, Fig. 1] teaches: "The risk-averse protagonist and risk-seeking adversarial agents learn policies to maximize or minimize reward, respectively. The use of the adversary helps the protagonist to effectively explore risky states." PAN [page 4, Algorithm 1] teaches updating the Q networks:
            
                G
                e
                n
                e
                r
                a
                t
                e
                 
                m
                a
                s
                k
                 
                M
                ∈
                
                        R
                    
                        k
                    
                ~
                P
                o
                i
                s
                s
                o
                n
                (
                q
                )
            
                U
                p
                d
                a
                t
                e
                 
                        Q
                    
                        p
                    
                        i
                    
                w
                i
                t
                h
                 
                R
                
                        B
                    
                        p
                    
                a
                n
                d
                 
                        M
                    
                        i
                    
                ,
                 
                i
                =
                1,2
                ,
                …
                ,
                k
                ;
            
                U
                p
                d
                a
                t
                e
                 
                        Q
                    
                        A
                    
                        i
                    
                w
                i
                t
                h
                 
                R
                
                        B
                    
                        A
                    
                a
                n
                d
                 
                        M
                    
                        i
                    
                ,
                 
                i
                =
                1,2
                ,
                …
                ,
                k
                ;
            
Examiner’s note: PAN’s algorithm 1 generates a mask             
                M
            
         of size             
                k
            
         using a Poisson distribution function. This mask determines which model is to be updated, and the selected models correspond to the sample subset.)
Applicant further argues: 
“Mihatsch only mentions variance in the context of maximizing the return of a portfolio and reducing its variance, and does not discuss reward variance. Accordingly, Mihatsch does not disclose reducing expected reward variance by updating in parallel, a sampled subset of a plurality of initialized learning tables to obtain a plurality of updated learning tables and generating an averaged learning Q' table based on the plurality of updated learning tables and the initialized learning tables outside the sampled subset.”
Examiner respectfully disagrees. Arguments regarding MIHATSCH are moot because MIHATSCH is not relied upon for teaching “a sampled subset of a plurality of initialized learning tables to obtain a plurality of updated learning tables and generating an averaged learning Q' table based on the plurality of updated learning tables and the initialized learning tables outside the sampled subset.”

Applicant further argues: 
“Dixon describes machine learning in finance. Applicant is unable to locate, in Dixon, any description of reducing expected reward variance by updating in parallel, a sampled subset of a plurality of initialized learning tables to obtain a plurality of updated learning tables and generating an averaged learning Q' table based on the plurality of updated learning tables and the initialized learning tables outside the sampled subset. Accordingly, Applicant submits that the cited references, whether considered alone or in combination, do not disclose the features of independent claims 1, 11. Further the Applicant submits that the dependent claims are also novel and non-obvious over the cited references at least by virtue of their dependencies on independent claims 1 and 11.”
Examiner respectfully disagrees. Arguments regarding DIXON are moot because DIXON is not relied upon for teaching “a sampled subset of a plurality of initialized learning tables to obtain a plurality of updated learning tables and generating an averaged learning Q' table based on the plurality of updated learning tables and the initialized learning tables outside the sampled subset.”

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-21 are rejected under 35 U.S.C.101 because the claimed invention is directed to an abstract idea without significantly more.

Step 1: Claims 11-20 are directed to a process. Claims 1-10 and 21 are directed to a machine or an article of manufacture.

With respect to claim(s) 1 and 11:
2A Prong 1: The claim(s) recite(s) an abstract idea. Specifically:
Initialize/initializing a plurality of learning tables                     
                        
                                Q
                            
                                i
                            
                 for the automated agent based on the plurality of states and the plurality of actions; (Mental process – Initializing a plurality of learning tables based on states and actions can be practically performed in the human mind, or by a human using a pen and paper as a physical aid. Paragraph [00142] of the specification states that all values in the plurality of learning tables are initialized to 0, which is reasonably within the capabilities of the human mind or a human using a pen and paper – see MPEP § 2106.04(a)(2)(III))
reduce/reducing expected reward variance by updating, in parallel, a sampled subset of the plurality of initialized learning tables                     
                        
                                Q
                            
                                i
                            
                 to obtain a plurality of updated learning tables using a utility function, the utility function comprising a monotonically increasing concave function; (Mathematical concepts – Updating the sampled subset of the plurality of initialized learning tables using a utility function involves mathematical calculations (see paragraph [0006]) – see MPEP § 2106.04(a)(2)(I))
generate/generating an averaged learning table Q' based on the plurality of updated learning tables and the one or more initialized learning tables outside the sampled subset (Mathematical concepts – Generating an averaged learning table Q’ involves mathematical calculations (see paragraph [0008]) – see MPEP § 2106.04(a)(2)(I))
If claim limitations, under their broadest reasonable interpretation, cover performance of the limitations as a mental process, but for the recitation of generic computer components, then the claim limitations fall within the mathematical or mental process grouping of abstract ideas. Accordingly, the claim “recites” an abstract idea.
2A Prong 2: The additional elements recited in the claim(s) do not integrate the abstract idea into a practical application, individually or in combination.
Additional elements:
(Claim 1) a communication interface; at least one processor; memory in communication with said at least one processor; (Mere recitation of a generic computer component – see § MPEP 2106.05(b)(I))
(Claim 1) software code stored in said memory, which when executed at said at least one processor causes said system to: (Adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea – see MPEP 2106.05(f).) 
instantiate/instantiating an automated agent that maintains a reinforcement learning neural network (Generally linking the use of a judicial exception to a particular technological environment or field of use – see MPEP § 2106.05(h).)  
receive/receiving by way of said communication interface, a plurality of training input data including a plurality of states and a plurality of actions for the automated agent; (Mere data gathering – Adding insignificant extra-solution activity of mere data gathering to the judicial exception – see § MPEP2106.05(g).)
generate a signal for communicating one or more resource task requests based on the averaged learning table Q’ (Mere instructions to apply an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea – see MPEP 2106.05(f).)
Since the claim as a whole, looking at the additional elements individually and in combination, does not contain any other additional elements that are indicative of integration into a practical application, the claim is directed to an abstract idea.
2B: The claim(s) do(es) not include additional elements that are sufficient to amount to significantly more than the judicial exception.
Additional elements:
(Claim 1) a communication interface; at least one processor; memory in communication with said at least one processor; (Mere recitation of a generic computer component – see § MPEP 2106.05(b)(I))
(Claim 1) software code stored in said memory, which when executed at said at least one processor causes said system to: (Adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea – see MPEP 2106.05(f).) 
instantiate/instantiating an automated agent that maintains a reinforcement learning neural network (Generally linking the use of a judicial exception to a particular technological environment or field of use – see MPEP § 2106.05(h).)  
receive/receiving by way of said communication interface, a plurality of training input data including a plurality of states and a plurality of actions for the automated agent; (Simply appending well-understood, routine, conventional activities previously known to the industry, specified at a high level of generality, to the judicial exception (WURC)- see MPEP § 2106.05(d)(ll)(i) - Receiving or transmitting data over a network, e.g., using the Internet to gather data, Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362 (utilizing an intermediary computer to forward information).)
generate a signal for communicating one or more resource task requests based on the averaged learning table Q’ (Mere instructions to apply an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea – see MPEP 2106.05(f).)
Considering the additional elements individually and in combination, and the claim as a whole, the additional elements do not provide significantly more than the abstract idea. Therefore, the claim is not patent eligible.

With respect to claim(s) 2 and 12:
2A Prong 1: The claims recite an abstract idea. Specifically:
(Claim 2) select an action […]
(Claim 12) further comprising selecting an action, by the automated agent, […]
[…] based on the averaged learning table Q' for communicating the one or more task requests. (Mental process – selecting an action based on the averaged learning table can be practically performed in the human mind, or by a human using a pen and paper as a physical aid – see MPEP § 2106.04(a)(2)(III)) 
2A Prong 2: The additional elements recited in the claims do not integrate the abstract idea into a practical application, individually or in combination.
Additional elements:
wherein said automated agent is configured to […] (Adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea – see MPEP 2106.05(f).) 
2B: The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception.
Additional elements:
wherein said automated agent is configured to […] (Adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea – see MPEP 2106.05(f).) 
Therefore, the claims are ineligible.

With respect to claim(s) 3 and 13:
2A Prong 1: The claims recite an abstract idea. Specifically:
wherein the utility function is represented by                     
                        u
                        
                                x
                            
                        =
                        -
                        
                                e
                            
                                β
                                x
                            
                        ,
                         
                        β
                        <
                        0
                    
                . (Mathematical concept – mathematical formula – see MPEP § 2106.04(a)(2)(I))  
Additionally, the claims do not recite any new additional elements that would amount to an integration of the abstract idea into a practical application (individually or in combination) or significantly more than the judicial exception.
Therefore, the claims are ineligible.

With respect to claim(s) 4 and 14:
2A Prong 1: The claims recite an abstract idea. Specifically:
updating the sampled subset of the plurality of initialized learning tables                     
                        
                                Q
                            
                                i
                            
                 comprises: (Mathematical concepts – Updating the sampled subset of the plurality of initialized learning tables involves using a utility function, and thus involves mathematical calculations (see paragraph [0006]) – see MPEP § 2106.04(a)(2)(I))
for each training step t, where t= 1, 2... T: computing an interim learning table                     
                        
                                Q
                            
                            ^
                        
                 based on the initialized learning table Q; (Mathematical concept & mental process – computing an interim learning table involves mathematical calculations and can be practically performed in the human mind, or by a human using a pen and paper as a physical aid – see MPEP § 2106.04(a)(2))  
selecting an action                     
                        
                                a
                            
                                t
                            
                 from the plurality of actions based on the interim learning table                     
                        
                                Q
                            
                            ^
                        
                 and a given state                     
                        
                                s
                            
                                t
                            
                 from the plurality of states; (Mental process – selecting an action can be practically performed in the human mind, or by a human using a pen and paper as a physical aid – see MPEP § 2106.04(a)(2)(III)) 
computing a reward                     
                        
                                r
                            
                                t
                            
                 and a next state                     
                        
                                s
                            
                                t
                                +
                                1
                            
                 based on the selected action                     
                        
                                a
                            
                                t
                            
                ; and (Mathematical concept & mental process – computing a reward and next state involves mathematical calculations and/or can be practically performed in the human mind, or by a human using a pen and paper as a physical aid – see MPEP § 2106.04(a)(2))  
for at least two values of i, where i= 1, 2, ..., k, computing a respective updated learning table                     
                        
                                Q
                            
                                i
                            
                 of the plurality of updated learning tables based on (                    
                        
                                s
                            
                                t
                            
                        ,
                         
                                a
                            
                                t
                            
                        ,
                         
                                r
                            
                                t
                            
                        ,
                         
                                s
                            
                                t
                                +
                                1
                            
                ) and the utility function. (Mathematical concept – computing a respective updated learned table based on state, action, reward, next state, and utility function involves mathematical calculations – see MPEP § 2106.04(a)(2)(III))  
2A Prong 2: The additional elements recited in the claims do not integrate the abstract idea into a practical application, individually or in combination.
Additional elements:
receiving, by way of said communication interface, an input parameter k and a training step parameter T; (Mere data gathering – see § MPEP2106.05(g).) 
2B: The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception.
Additional elements:
receiving, by way of said communication interface, an input parameter k and a training step parameter T; (Simply appending well-understood, routine, conventional activities previously known to the industry, specified at a high level of generality, to the judicial exception (WURC)- see MPEP § 2106.05(d)(ll)(i) - Receiving or transmitting data over a network, e.g., using the Internet to gather data, Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362 (utilizing an intermediary computer to forward information).) 
Therefore, the claims are ineligible.

With respect to claim(s) 5 and 15:
2A Prong 1: The claims recite an abstract idea. Specifically:
wherein the averaged learning table Q' is computed as                     
                        
                                1
                            
                                k
                            
                                ∑
                                
                                    i
                                    =
                                    1
                                
                                    k
                                
                                        Q
                                    
                                        i
                                    
                . (Mathematical concept –mathematical formula – see MPEP § 2106.04(a)(2)(I))  
Additionally, the claims do not recite any new additional elements that would amount to an integration of the abstract idea into a practical application (individually or in combination) or significantly more than the judicial exception.
Therefore, the claims are ineligible.

With respect to claim(s) 6 and 16:
2A Prong 1: The claims recite an abstract idea. Specifically:
(Claim 6) initialize […]
(Claim 16) initializing […]
[…] an adversarial learning table                     
                        
                                Q
                            
                                A
                            
                 for the adversarial agent; (Mental process – initializing an adversarial table can be practically performed in the human mind, or by a human using a pen and paper as a physical aid. Paragraph [00142] of the specification states that all values in the plurality of learning tables are initialized to 0, which is reasonably within the capabilities of the human mind or a human using a pen and paper – see MPEP § 2106.04(a)(2)(III)) 
compute a plurality of updated adversarial learning tables based on the initialized adversarial learning table                     
                        
                                Q
                            
                                A
                            
                 using a second utility function, the second utility function comprising a monotonically increasing convex function; (Mathematical calculations & mental process – computing updated adversarial learning tables using a monotonically increasing convex function involves mathematical calculations and/or can be practically performed in the human mind, or by a human using a pen and paper as a physical aid – see MPEP § 2106.04(a)(2)(III)) 
(Claim 6) generate […]
(Claim 16) generating […]
[…] an averaged adversarial learning table                     
                        
                                Q
                            
                                A
                            
                ' based on the plurality of updated adversarial learning tables. (Mental process – generating an averaged adversarial learning table can be practically performed in the human mind, or by a human using a pen and paper as a physical aid – see MPEP § 2106.04(a)(2)(III)) 
2A Prong 2: The additional elements recited in the claims do not integrate the abstract idea into a practical application, individually or in combination.
Additional elements:
generates, according to outputs of said adversarial reinforcement learning neural network, signals for communicating adversarial task requests; (Adding insignificant extra-solution activity to the judicial exception – see § MPEP2106.05(g).)  
(Claim 6) instantiate […]
(Claim 16) instantiating […]
[…] an adversarial agent that maintains an adversarial reinforcement learning neural network (Generally linking the use of a judicial exception to a particular technological environment or field of use – see MPEP § 2106.05(h).) 
2B: The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception.
Additional elements:
generates, according to outputs of said adversarial reinforcement learning neural network, signals for communicating adversarial task requests; (Simply appending well-understood, routine, conventional activities previously known to the industry, specified at a high level of generality, to the judicial exception (WURC)- see MPEP § 2106.05(d)(ll)(i) - Receiving or transmitting data over a network, e.g., using the Internet to gather data, Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362 (utilizing an intermediary computer to forward information).)  
(Claim 6) instantiate […]
(Claim 16) instantiating […]
[…] an adversarial agent that maintains an adversarial reinforcement learning neural network (Generally linking the use of a judicial exception to a particular technological environment or field of use – see MPEP § 2106.05(h).) 
Therefore, the claims are ineligible.

With respect to claim(s) 7 and 17:
2A Prong 1: The claims recite an abstract idea. Specifically:
(Claim 7) wherein said adversarial agent is configured to select an adversarial action […] 
(Claim 17) further comprising selecting an adversarial action, by the adversarial agent, […]
[…] based on the averaged adversarial learning table                     
                        
                                Q
                            
                                A
                            
                ' to minimize a reward for the automated agent. (Mathematical concepts – An adversarial agent uses the utility function to minimize the reward (see paragraph [00148]) – see MPEP § 2106.04(a)(2)(I))
Additionally, the claim does not recite any new additional elements that would amount to an integration of the abstract idea into a practical application (individually or in combination) or significantly more than the judicial exception.
Therefore, the claims are ineligible.

With respect to claim(s) 8 and 18:
2A Prong 1: The claims recite an abstract idea. Specifically:
wherein the second utility function is represented by                     
                        
                                u
                            
                                A
                            
                                x
                            
                        =
                        -
                        
                                e
                            
                                        β
                                    
                                        A
                                    
                                x
                            
                        ,
                         
                                β
                            
                                A
                            
                        >
                        0
                    
                . (Mathematical concept – mathematical formula – see MPEP § 2106.04(a)(2)(I))  
Additionally, the claims do not recite any new additional elements that would amount to an integration of the abstract idea into a practical application (individually or in combination) or significantly more than the judicial exception.
Therefore, the claims are ineligible.

With respect to claim(s) 9 and 19:
2A Prong 1: The claims recite an abstract idea. Specifically:
for each training step t, where t= 1, 2... T:
computing an interim adversarial learning table                     
                        
                                        Q
                                    
                                    ^
                                
                                A
                            
                 based on the initialized adversarial learning table                     
                        
                                Q
                            
                                A
                            
                ; (Mathematical concept & mental process – computing an interim adversarial table involves mathematical calculations and/or can be practically performed in the human mind, or by a human using a pen and paper as a physical aid – see MPEP § 2106.04(a)(2)) 
selecting an adversarial action                     
                        
                                a
                            
                                t
                            
                                A
                            
                 based on the interim adversarial learning table                     
                        
                                        Q
                                    
                                    ^
                                
                                A
                            
                 and a given state                     
                        
                                s
                            
                                t
                            
                 from the plurality of states; (Mental process – selecting an adversarial action can be practically performed in the human mind, or by a human using a pen and paper as a physical aid – see MPEP § 2106.04(a)(2)(III)) 
computing an adversarial reward                     
                        
                                r
                            
                                t
                            
                                A
                            
                 and a next state                     
                        
                                s
                            
                                t
                                +
                                1
                            
                 based on the selected adversarial action                     
                        
                                a
                            
                                t
                            
                                A
                            
                ; (Mathematical concept & mental process – computing an adversarial reward and next state involves mathematical calculations and/or can be practically performed in the human mind, or by a human using a pen and paper as a physical aid – see MPEP § 2106.04(a)(2)(III)) 
for at least two values of i, where i= 1, 2, ..., k, computing a respective updated adversarial learning table                     
                        
                                Q
                            
                                A
                            
                                i
                            
                 of the plurality of updated adversarial learning tables based on (                    
                        
                                s
                            
                                t
                            
                        ,
                         
                                a
                            
                                t
                            
                                A
                            
                        ,
                         
                                r
                            
                                t
                            
                                A
                            
                        ,
                         
                                s
                            
                                t
                                +
                                1
                            
                ) and the second utility function. (Mathematical concept & mental process –computing respective updated adversarial learning tables based on state, action, reward, next state and the second utility function involves mathematical calculations and/or can be practically performed in the human mind, or by a human using a pen and paper as a physical aid – see MPEP § 2106.04(a)(2)(III)) 
2A Prong 2: The additional elements recited in the claim do not integrate the abstract idea into a practical application, individually or in combination.
Additional elements:
receiving, by way of said communication interface, an input parameter k and a training step parameter T; (Mere data gathering – see § MPEP2106.05(g).) 
2B: The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception.
Additional elements:
receiving, by way of said communication interface, an input parameter k and a training step parameter T; (Simply appending well-understood, routine, conventional activities previously known to the industry, specified at a high level of generality, to the judicial exception (WURC)- see MPEP § 2106.05(d)(ll)(i) - Receiving or transmitting data over a network, e.g., using the Internet to gather data, Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362 (utilizing an intermediary computer to forward information).)
Therefore, the claims are ineligible.

With respect to claim(s) 10 and 20:
2A Prong 1: The claims recite an abstract idea. Specifically:
wherein the averaged adversarial learning table                     
                        
                                Q
                            
                                A
                            
                ’ is computed as                      
                        
                                1
                            
                                k
                            
                                ∑
                                
                                    i
                                    =
                                    1
                                
                                    k
                                
                                        Q
                                    
                                        A
                                    
                                        i
                                    
                . (Mathematical concept – mathematical formula – see MPEP § 2106.04(a)(2)(I))  
Additionally, the claims do not recite any new additional elements that would amount to an integration of the abstract idea into a practical application (individually or in combination) or significantly more than the judicial exception.
Therefore, the claims are ineligible.

With respect to claim(s) 21:
2A Prong 1: The claim(s) recite(s) an abstract idea. Specifically:
wherein the utility function is applied to a temporal difference error between a first state t and a next state t+1, each state associated with an action of the automated agent. (Mathematical concepts – The utility function involves mathematical calculations and applying it constitutes performing mathematical calculations – see MPEP § 2106.04(a)(2)(I))
Additionally, the claim(s) do not recite any new additional elements that would amount to an integration of the abstract idea into a practical application (individually or in combination) or significantly more than the judicial exception.
Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible. Therefore, the claim is not patent eligible.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-21 are rejected under 35 U.S.C. 103 as being unpatentable over PAN (“Risk Averse Robust Adversarial Reinforcement Learning”) in view of MIHATSCH ("Risk-Sensitive Reinforcement Learning") and DIXON et al. ("Machine Learning in Finance"), hereafter PAN, MIHATSCH, and DIXON respectively.

Regarding Claim 1:
PAN teaches:
software code stored in said memory, which when executed at said at least one processor causes said system to: (PAN [page 2, I. Introduction] teaches: "We use a discrete control task, autonomous driving with the TORCS [15] simulator". Examiner’s note: “software code” can be interpreted as the TORCs simulator, which a person having ordinary skill in the art would recognize that installing TORCs to run simulations requires memory in communication with a processor to run such simulations.)
communication interface; at least one processor; memory in communication with said at least one processor; (PAN incorporates by reference [15] B. Wymann, E. Espié, C. Guionneau, C. Dimitrakakis, R. Coulom, and A. Sumner, “Torcs, the open racing car simulator,” Software available at http://torcs.sourceforge.net, 2000, which is describes TORCs as an AI racing game and research platform, thus requiring an interface for communication to receive inputs from the user or researcher.)
instantiate an automated agent that maintains a reinforcement learning neural network  and generates, according to outputs of said reinforcement learning neural network, signals for communicating task requests; (PAN [page 1, I. Introduction] teaches: "We additionally focus on a framework that includes an adversary in addition to the main (i.e., protagonist) agent." PAN [page 1, Abstract] teaches: "We use an ensemble of policy networks to model risk as the variance of value functions." PAN [page 3, B. Reward Design and Risk Modeling] teaches: "Inspired by [14], we estimate the variance of Q value functions by training multiple Q value networks in parallel." (PAN [page 3, C. Risk Averse RARL] teaches: "The Q learning Bellman equation is modified to be compatible with this case. Let the current and target value functions be                         
                            
                                    Q
                                
                                    P
                                
                     and                         
                            
                                    Q
                                
                                    P
                                
                                    *
                                
                     for the protagonist, and (respectively)                         
                            
                                    Q
                                
                                    A
                                
                     and                         
                            
                                    Q
                                
                                    A
                                
                                    *
                                
                     for the adversary." PAN [page 3, B. Reward Design and Risk Modeling] teaches: "The variance in Equation 3 measures risk, and our goal is for the protagonist and adversarial agents to select actions with low and high variance, respectively." PAN [ page 3, III. Risk Averse Robust Adversarial RL] teaches: "For example, in autonomous driving, a single risky action may not put the vehicle in a dangerous condition." PAN [page 4, IV. Experiments] teaches: "The vehicle can execute nine actions: (1) move left and accelerate, (2) move ahead and accelerate, (3) move right and accelerate, (4) move left, (5) do nothing, (6) move right, (7) move right and decelerate, (8) move ahead and decelerate, (9) move right and decelerate." Examiner's note: under BRI, "instantiate an automated agent" can be interpreted as including the main agent (i.e., protagonist), and "that maintains a reinforcement learning neural network" can be interpreted as the Q Value networks that are trained when estimating the Q value functions in the two-agent reinforcement learning scenario. Furthermore, under BRI, "generates [...] signals for communicating task requests" can be interpreted as the action selected for driving the autonomous vehicle, for example, which are the outputs of the Q network (i.e., reinforcement learning neural network).)
receive, by way of said communication interface, a plurality of training input data including a plurality of states and a plurality of actions for the automated agent; (PAN [page 3, B. Reward Design and Risk Modeling] teaches: "As shown in Figure 2, the network takes in input s, which consists of stacked frames of consecutive observations." PAN [page 4, IV. Experiments] teaches: "The vehicle can execute nine actions […]." PAN [page 2, I. Introduction] teaches: "We use a discrete control task, autonomous driving with the TORCS [15] simulator". Examiner's note: under BRI, "plurality of states" can be interpreted as the stacked frames of consecutive observations (i.e., states), "plurality of actions for the automated agent" can be interpreted as the actions the vehicle can execute, and "said communication interface" can be interpreted as the TORCS simulator.)
initialize a plurality of learning tables                         
                            
                                    Q
                                
                                    i
                                
                     for the automated agent based on the plurality of states and the plurality of actions; (PAN [page 4, Algorithm 1] teaches initializing                         
                            
                                    Q
                                
                                    i
                                
                     value networks for both protagonist and adversarial networks (i.e., tables Q), where                         
                            i
                            =
                            1
                            ,
                             
                            …
                            ,
                            k
                        
                     (i.e., a plurality of learning tables) where the networks receive as input actions and states during training.)
reduce expected reward variance by updating, in parallel, a sampled subset of the plurality of initialized learning tables                         
                            
                                    Q
                                
                                    i
                                
                     to obtain a plurality of updated learning tables using a […] function, […]; (PAN [page 2, section I. Introduction] teaches: “A robust policy should not only maximize long term expected reward, but should also select actions with low variance of that expected reward (i.e., reduce expected reward variance). Maximizing the expectation of the value function only maximizes the point estimate of that function without giving a guarantee on the variance.” PAN [page 3, B. Reward Design and Risk Modeling] teaches: “The risk of an action can be modeled by estimating the variance of the value function across different models trained on different sets of data. Inspired by [14], we estimate the variance of Q value functions by training multiple Q value networks in parallel (i.e., by updating in parallel).” PAN [page 4, section C. Risk Averse RARL] teaches: “The use of masks in updating Q value functions is similar to [14], where the mask is a integer vector of size equal to batch size times number of ensemble Q networks, and is used to determine which model is to be updated with the sample batch (i.e., updating […] a sampled subset of the plurality of initialized learning tables                         
                            
                                    Q
                                
                                    i
                                
                     to obtain a plurality of updated learning tables).” PAN [page 3, B. Reward Design and Risk Modeling] teaches: " When updating Q functions, our algorithm (like DQN [2]) samples a batch of data of size B from the replay buffer                         
                            
                                                            s
                                                            ,
                                                            a
                                                            ,
                                                            
                                                                    s
                                                                
                                                                    '
                                                                
                                                            ,
                                                            r
                                                            ,
                                                            d
                                                            o
                                                            n
                                                            e
                                                        
                                                    t
                                                
                                    t
                                    =
                                    1
                                
                                    B
                                
                     which, for each data point, includes the state, action, next state, reward, and task completion signal.” PAN [page 1, Fig. 1] teaches: "The risk-averse protagonist and risk-seeking adversarial agents learn policies to maximize or minimize reward, respectively. The use of the adversary helps the protagonist to effectively explore risky states." PAN [page 4, Algorithm 1] teaches updating the Q networks:
                
                    G
                    e
                    n
                    e
                    r
                    a
                    t
                    e
                     
                    m
                    a
                    s
                    k
                     
                    M
                    ∈
                    
                            R
                        
                            k
                        
                    ~
                    P
                    o
                    i
                    s
                    s
                    o
                    n
                    (
                    q
                    )
                
                    U
                    p
                    d
                    a
                    t
                    e
                     
                            Q
                        
                            p
                        
                            i
                        
                    w
                    i
                    t
                    h
                     
                    R
                    
                            B
                        
                            p
                        
                    a
                    n
                    d
                     
                            M
                        
                            i
                        
                    ,
                     
                    i
                    =
                    1,2
                    ,
                    …
                    ,
                    k
                    ;
                
                    U
                    p
                    d
                    a
                    t
                    e
                     
                            Q
                        
                            A
                        
                            i
                        
                    w
                    i
                    t
                    h
                     
                    R
                    
                            B
                        
                            A
                        
                    a
                    n
                    d
                     
                            M
                        
                            i
                        
                    ,
                     
                    i
                    =
                    1,2
                    ,
                    …
                    ,
                    k
                    ;
                
Examiner’s note: PAN’s algorithm 1 generates a mask                 
                    M
                
             of size                 
                    k
                
             using a Poisson distribution function. This mask determines which model is to be updated, and the selected models correspond to the sample subset.)
generate an averaged learning table Q' based on the plurality of updated learning tables and the one or more initialized learning tables outside the sampled subset (PAN [page 3, III. Risk Averse Robust Adversarial RL] teaches: "the input is passed to k different heads which perform one dense layer to obtain k action-value outputs:                         
                            {
                            
                                    Q
                                
                                    1
                                
                                    s
                                    ,
                                     
                                    ∙
                                
                            ,
                            …
                            ,
                             
                                    Q
                                
                                    k
                                
                                    s
                                    ,
                                    ∙
                                
                            }
                        
                    . Defining the mean as                         
                            
                                    Q
                                
                                ~
                            
                                    s
                                    ,
                                    a
                                
                            =
                            
                                    1
                                
                                    k
                                
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                        k
                                    
                                            Q
                                        
                                            i
                                        
                                    (
                                    s
                                    ,
                                    a
                                    )
                                
                    , the variance of a single action a is,
                
                    V
                    a
                    
                            r
                        
                            k
                        
                            Q
                            
                                    s
                                    ,
                                    a
                                
                    =
                    
                            1
                        
                            k
                        
                            ∑
                            
                                i
                                =
                                1
                            
                                k
                            
                                                    Q
                                                
                                                    i
                                                
                                                    s
                                                    ,
                                                    a
                                                
                                            -
                                            
                                                    Q
                                                
                                                ~
                                            
                                                    s
                                                    ,
                                                    a
                                                
                                    2
                                
where we use the k subscripts to indicate variance over k models, as in Equations 1 and 2.” PAN [page 4, Algorithm 1] teaches updating the Q networks:
                
                    G
                    e
                    n
                    e
                    r
                    a
                    t
                    e
                     
                    m
                    a
                    s
                    k
                     
                    M
                    ∈
                    
                            R
                        
                            k
                        
                    ~
                    P
                    o
                    i
                    s
                    s
                    o
                    n
                    (
                    q
                    )
                
                    U
                    p
                    d
                    a
                    t
                    e
                     
                            Q
                        
                            p
                        
                            i
                        
                    w
                    i
                    t
                    h
                     
                    R
                    
                            B
                        
                            p
                        
                    a
                    n
                    d
                     
                            M
                        
                            i
                        
                    ,
                     
                    i
                    =
                    1,2
                    ,
                    …
                    ,
                    k
                    ;
                
                    U
                    p
                    d
                    a
                    t
                    e
                     
                            Q
                        
                            A
                        
                            i
                        
                    w
                    i
                    t
                    h
                     
                    R
                    
                            B
                        
                            A
                        
                    a
                    n
                    d
                     
                            M
                        
                            i
                        
                    ,
                     
                    i
                    =
                    1,2
                    ,
                    …
                    ,
                    k
                    ;
                
Examiner’s note: PAN’s algorithm 1 generates a mask                 
                    M
                
             of size                 
                    k
                
             using a Poisson distribution function. This mask determines which model is to be updated, and the selected models correspond to the sample subset. PAN’s mean formula                 
                    
                            Q
                        
                        ~
                    
                            s
                            ,
                            a
                        
             described above teaches computing the average over all                 
                    k
                
             Q value networks, which include both the networks selected for update by the Poisson mask and the models not selected for update by the Poisson mask (i.e., based on […] one or more initialized learning tables outside the sampled subset).)
generate a signal for communicating one or more resource task requests based on the averaged learning table Q’. (PAN [page 1, section I. Introduction] teaches: “We use a discrete control task, autonomous driving with the TORCS [15] simulator, to demonstrate the benefits of RARARL.” PAN [page 3, III. Risk Averse Robust Adversarial RL] teaches: "At test time, the mean value                         
                            
                                    Q
                                
                                ~
                            
                            (
                            s
                            ,
                            a
                            )
                        
                     is used for selecting actions." PAN [page 4, IV. Experiments] teaches: "The vehicle can execute nine actions: […]". Examiner’s note: The autonomous vehicle can only execute the selected action if this action was communicated to the vehicle.)
However, PAN is not relied upon for teaching, but MIHATSCH teaches: […] a utility function, the utility function comprising a monotonically increasing […] function (MIHATSCH [page 273, section 5. A new framework for risk-sensitive control] teaches: "Let κ ∈ (−1, 1) be a scalar parameter which we use to specify the desired risk-sensitivity. We define the transformation function
                
                            χ
                        
                            K
                        
                    :
                    x
                    →
                    
                                                    1
                                                    -
                                                    k
                                                
                                            x
                                            ,
                                             
                                            i
                                            f
                                             
                                            x
                                            >
                                            0
                                        
                                            (
                                            1
                                            +
                                            k
                                            )
                                            x
                                            ,
                                             
                                            o
                                            t
                                            h
                                            e
                                            r
                                            w
                                            i
                                            s
                                            e
                                        
[…] In other words, the objective function is risk-avoiding if                 
                    k
                    >
                    0
                
             and risk-seeking if                 
                    k
                    <
                    0
                
            ." Examiner's note: under BRI, "a monotonically increasing […] function" can be interpreted as the transformation function                 
                    
                            X
                        
                            K
                        
             (i.e., utility function). As                 
                    k
                    →
                    -
                    1
                
            , the functions would then take the form of                 
                    2
                    x
                
             if                 
                    x
                    >
                    0
                
             and 0 otherwise. In the same manner, as                 
                    k
                    →
                    1
                
            , the functions would then take the form of 0 if                 
                    x
                    >
                    0
                
             and                 
                    2
                    x
                
             otherwise. Therefore, the function is always increasing or 0, thus monotonically increasing.)
Accordingly, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention, having the teachings of PAN and MIHATSCH before them, to include MIHATSCH’s transformation function in PAN's risk averse adversarial reinforcement learning method. One would have been motivated to make such a combination in order to better integrate risk-sensitive methods into machine learning algorithms with less restrictions for handling many real-world problems (MIHATSCH [page 272, 4. Risk-sensitive control based on exponential utilities]).
However, PAN in view of MIHATSCH is not relied upon for teaching, but DIXON teaches: […] a monotonically increasing concave function (DIXON [page 358, Example 10.8] teaches: "In particular, one popular choice is given by the exponential utility function                         
                            U
                            (
                            X
                            )
                            =
                            -
                            e
                            x
                            p
                            (
                            -
                            γ
                            X
                            )
                        
                    , (10.18)".)
Accordingly, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention, having the teachings of PAN, MIHATSCH, and DIXON before them, to include DIXON's utility function, interface, hardware, and/or GPU cards in PAN and MIHATSCH's risk averse reinforcement learning method. One would have been motivated to make such a combination in order to make agents more rational and maximize their rewards (DIXON et al. [page 507, Bounded Rationality]).

Regarding Claim 2:
PAN in view of MIHATSCH and DIXON teaches the elements of claim 1 as outlined above. PAN further teaches:
wherein said automated agent is configured to select an action based on the averaged learning table Q' for communicating the one or more task requests. (PAN [page 3, III. Risk Averse Robust Adversarial RL] teaches: "At test time, the mean value                         
                            
                                    Q
                                
                                ~
                            
                            (
                            s
                            ,
                            a
                            )
                        
                     is used for selecting actions." PAN [page 4, IV. Experiments] teaches: "The vehicle can execute nine actions: […]". Examiner’s note: under BRI, the “automated agent” can be interpreted as the Q network, which uses the mean value (i.e., averaged learning table Q’)  to select the action. The mean value                         
                            
                                    Q
                                
                                ~
                            
                            (
                            s
                            ,
                            a
                            )
                        
                     is used to select actions for both the protagonist (i.e., automated agent) and an adversarial agent.)

Regarding Claim 3:
The combination of PAN/MIHATSCH/DIXON teaches the elements of claim 1 as outlined above. The combination of PAN/MIHATSCH/DIXON also teaches:
wherein the utility function is represented by                         
                            u
                            
                                    x
                                
                            =
                            -
                            
                                    e
                                
                                    β
                                    x
                                
                            ,
                             
                            β
                            <
                            0
                        
                    . (DIXON [page 358, Example 10.8] teaches: "In particular, one popular choice is given by the exponential utility function                         
                            U
                            (
                            X
                            )
                            =
                            -
                            e
                            x
                            p
                            (
                            -
                            γ
                            X
                            )
                        
                    , (10.18)". Examiner’s note: under BRI, the                         
                            β
                        
                     can be interpreted as                         
                            -
                            γ
                        
                    , which is less than 0.)

Regarding Claim 4:
PAN in view of MIHATSCH and DIXON teaches the elements of claim 1 as outlined above. PAN further teaches:
wherein updating the sampled subset of the plurality of initialized learning tables                         
                            
                                    Q
                                
                                    i
                                
                     comprises: (PAN [page 4, section C. Risk Averse RARL] teaches: “The use of masks in updating Q value functions is similar to [14], where the mask is a integer vector of size equal to batch size times number of ensemble Q networks, and is used to determine which model is to be updated with the sample batch (i.e., updating the sampled subset of the plurality of initialized learning tables                         
                            
                                    Q
                                
                                    i
                                
                    ).”)
receiving, by way of said communication interface, an input parameter k and a training step parameter T; and (PAN [page 4, Algorithm 1] teaches: "Input: Training steps T; Environment                         
                            e
                            n
                            v
                        
                    ; Adversarial Action Schedule                         
                            Ξ
                        
                    ; Exploration rate                         
                            ε
                        
                    ; Number of models                         
                            k
                        
                    ." PAN [page 2, I. Introduction] teaches: "We use a discrete control task, autonomous driving with the TORCS [15] simulator". Examiner's note: under BRI, "said communication interface" can be interpreted as the TORCS simulator.)
for each training step t, where t= 1, 2... T: (PAN [page 4, Algorithm 1] teaches: “While                         
                            t
                            <
                            T
                        
                     do”)
computing an interim learning table                         
                            
                                    Q
                                
                                ^
                            
                     based on the initialized learning table Q; (PAN [page 2, III. Risk Averse Robust Adversarial RL] teaches: "The protagonist should be risk averse, so we define the value of action                         
                            a
                        
                     at state                         
                            s
                        
                     to be
                
                                    Q
                                
                                ^
                            
                            P
                        
                            s
                            ,
                            a
                        
                    =
                    
                            Q
                        
                            P
                        
                            s
                            ,
                            a
                        
                    -
                    
                            λ
                        
                            P
                        
                    V
                    a
                    
                            r
                        
                            k
                        
                                    Q
                                
                                    P
                                
                                    k
                                
                                    s
                                    ,
                                    a
                                
                    ,
                     
                    (
                    1
                    )
                
where                 
                    
                                    Q
                                
                                ^
                            
                            P
                        
                            s
                            ,
                            a
                        
             is the modified Q function,                 
                    
                            Q
                        
                            P
                        
                            s
                            ,
                            a
                        
             is the original Q function, and                 
                    V
                    a
                    
                            r
                        
                            k
                        
                                    Q
                                
                                    P
                                
                                    k
                                
                                    s
                                    ,
                                    a
                                
             is the variance of the Q function across                 
                    k
                
             different models, and                 
                    
                            λ
                        
                            P
                        
             is a constant;" PAN [page 4, Algorithm 1] teaches: "Compute                 
                    
                                    Q
                                
                                ^
                            
                            P
                        
                            s
                            ,
                            a
                        
             according to (1) and (2) ;" Additionally, PAN [page 4, Algorithm 1] teaches: initializing  both protagonist and adversarial networks (i.e., table Q), where the networks receive actions and states as inputs during training.” Examiner's note: under BRI, "interim learning table Q" can be interpreted as the modified Q function, which is based on the initialized values used for the protagonist and adversarial network.”)
selecting an action                         
                            
                                    a
                                
                                    t
                                
                     from the plurality of actions based on the interim learning table                         
                            
                                    Q
                                
                                ^
                            
                     and a given state                         
                            
                                    s
                                
                                    t
                                
                     from the plurality of states; (PAN [page 4, Algorithm 1] teaches: "Select action according to                         
                            
                                            Q
                                        
                                        ^
                                    
                                    g
                                
                            (
                            s
                            ,
                            a
                            )
                        
                     by applying                         
                            ε
                        
                    -greedy strategy". Examiner's note: under BRI, the argument  "s" passed to                         
                            
                                            Q
                                        
                                        ^
                                    
                                    g
                                
                            (
                            s
                            ,
                            a
                            )
                        
                     is the given state.)
computing a reward                         
                            
                                    r
                                
                                    t
                                
                     and a next state                         
                            
                                    s
                                
                                    t
                                    +
                                    1
                                
                     based on the selected action                         
                            
                                    a
                                
                                    t
                                
                    ; and (PAN [page 4, Algorithm 1] teaches: "Execute action and get obs; reward; done;" Examiner's note: under BRI, the obs can be interpreted as the next state.)
for at least two values of i, where i= 1, 2, ..., k, computing a respective updated learning table                         
                            
                                    Q
                                
                                    i
                                
                     of the plurality of updated learning tables based on (                        
                            
                                    s
                                
                                    t
                                
                            ,
                             
                                    a
                                
                                    t
                                
                            ,
                             
                                    r
                                
                                    t
                                
                            ,
                             
                                    s
                                
                                    t
                                    +
                                    1
                                
                    ) and the utility function. (PAN [page 4, Algorithm 1] teaches:
                
                    G
                    e
                    n
                    e
                    r
                    a
                    t
                    e
                     
                    m
                    a
                    s
                    k
                     
                    M
                    ∈
                    
                            R
                        
                            k
                        
                    ~
                    P
                    o
                    i
                    s
                    s
                    o
                    n
                    (
                    q
                    )
                
                    U
                    p
                    d
                    a
                    t
                    e
                     
                            Q
                        
                            p
                        
                            i
                        
                    w
                    i
                    t
                    h
                     
                    R
                    
                            B
                        
                            p
                        
                    a
                    n
                    d
                     
                            M
                        
                            i
                        
                    ,
                     
                    i
                    =
                    1,2
                    ,
                    …
                    ,
                    k
                    ;
                
                    U
                    p
                    d
                    a
                    t
                    e
                     
                            Q
                        
                            A
                        
                            i
                        
                    w
                    i
                    t
                    h
                     
                    R
                    
                            B
                        
                            A
                        
                    a
                    n
                    d
                     
                            M
                        
                            i
                        
                    ,
                     
                    i
                    =
                    1,2
                    ,
                    …
                    ,
                    k
                    ;
                
Additionally, PAN [page 3, B. Reward Design and Risk Modeling] teaches: “When updating Q functions, our algorithm (like DQN [2]) samples a batch of data of size B from the replay buffer                 
                    
                                                    s
                                                    ,
                                                    a
                                                    ,
                                                    
                                                            s
                                                        
                                                            '
                                                        
                                                    ,
                                                    r
                                                    ,
                                                    d
                                                    o
                                                    n
                                                    e
                                                
                                            t
                                        
                            t
                            =
                            1
                        
                            B
                        
             which, for each data point, includes the state, action, next state, reward, and task completion signal.”)
and the utility function. (DIXON [page 358, Example 10.8] teaches: "In particular, one popular choice is given by the exponential utility function                         
                            U
                            
                                    X
                                
                            =
                            -
                            e
                            x
                            p
                            (
                            -
                            γ
                            X
                            )
                        
                    , (10.18)".)

Regarding Claim 5:
The combination of PAN/MIHATSCH/DIXON teaches the elements of claim 4 as outlined above. The combination of PAN/MIHATSCH/DIXON also teaches:
The computer-implemented system of claim 4, wherein the averaged learning table Q' is computed as                         
                            
                                    1
                                
                                    k
                                
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                        k
                                    
                                            Q
                                        
                                            i
                                        
                    . (PAN [page 3, III. Risk Averse Robust Adversarial RL] teaches: " Defining the mean as                         
                            
                                    Q
                                
                                ~
                            
                                    s
                                    ,
                                    a
                                
                            =
                            
                                    1
                                
                                    k
                                
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                        k
                                    
                                            Q
                                        
                                            i
                                        
                                    (
                                    s
                                    ,
                                    a
                                    )
                                
                    ”.)

Regarding Claim 6:
The combination of PAN/MIHATSCH/DIXON teaches the elements of claim 1 as outlined above. The combination of PAN/MIHATSCH/DIXON also teaches:
The computer-implemented system of claim 1, wherein the utility function is a first utility function, and (DIXON [page 358, Example 10.8] teaches: "In particular, one popular choice is given by the exponential utility function                         
                            U
                            
                                    X
                                
                            =
                            -
                            e
                            x
                            p
                            (
                            -
                            γ
                            X
                            )
                        
                    , (10.18)". Examiner’s note: we note that the formula in DIXON has the negative parameter                         
                            -
                            γ
                        
                    . Furthermore, MIHATSCH [page 272, section 4. Risk-sensitive control based on exponential utilities] teaches: "Therefore, the objective is risk-averse for β < 0 and risk-seeking for β > 0." Moreover, PAN [page 1, Fig. 1] teaches: "The risk-averse protagonist and risk-seeking adversarial agents learn policies to maximize or minimize reward, respectively.” Additionally, the purpose of DIXON's utility function is to maximize the utility (reward) function, and the protagonist’s objectives are risk-averse (i.e., avoid risk), therefore a person having ordinary skill in the art would have been able to determine that a parameter −γ (i.e.,                         
                            β
                            <
                            0
                        
                    ) would maximize the goal of the protagonist. Therefore, under BRI, “a first utility function” can be interpreted as DIXON’s utility function.)
wherein the software code, when executed at said at least one processor, further causes said system to: (Examiner’s note: this limitation is taught by the combination of PAN/MIHATSCH/DIXON in claim 1; under BRI, “software code” can be interpreted as PAN’s TORCs simulator (PAN [page 2, I. Introduction]), and “at least one processor” can be interpreted as DIXON’s GPU card (DIXON [page 148, section 5.2.1 Computational Considerations]).)
instantiate an adversarial agent that maintains an adversarial reinforcement learning neural network and generates, according to outputs of said adversarial reinforcement learning neural network, signals for communicating adversarial task requests; (PAN [page 1, I. Introduction] teaches: "We additionally focus on a framework that includes an adversary in addition to the main (i.e., protagonist) agent." PAN [page 1, Abstract] teaches: "We use an ensemble of policy networks to model risk as the variance of value functions." PAN [page 3, B. Reward Design and Risk Modeling] teaches: "Inspired by [14], we estimate the variance of Q value functions by training multiple Q value networks in parallel." PAN [page 3, C. Risk Averse RARL] teaches: "The Q learning Bellman equation is modified to be compatible with this case. Let the current and target value functions be                         
                            
                                    Q
                                
                                    P
                                
                     and                         
                            
                                    Q
                                
                                    P
                                
                                    *
                                
                     for the protagonist, and (respectively)                         
                            
                                    Q
                                
                                    A
                                
                     and                         
                            
                                    Q
                                
                                    P
                                
                                    *
                                
                     for the adversary." PAN [page 3, B. Reward Design and Risk Modeling] teaches: "The variance in Equation 3 measures risk, and our goal is for the protagonist and adversarial agents to select actions with low and high variance, respectively." PAN [ page 3, III. Risk Averse Robust Adversarial RL] teaches: "For example, in autonomous driving, a single risky action may not put the vehicle in a dangerous condition." PAN [page 4, IV. Experiments] teaches: "The vehicle can execute nine actions: (1) move left and accelerate, (2) move ahead and accelerate, (3) move right and accelerate, (4) move left, (5) do nothing, (6) move right, (7) move right and decelerate, (8) move ahead and decelerate, (9) move right and decelerate." Examiner's note: under BRI, "instantiate an adversarial agent" can be interpreted as including the adversary agent, and "that maintains a reinforcement learning neural network" can be interpreted as the Q Value networks that are trained when estimating the Q value functions in the two-agent reinforcement learning scenario. Furthermore, under BRI, "generates [...] signals for communicating task requests" can be interpreted as the action selected to for driving the autonomous vehicle, for example, which are the outputs of the Q network (i.e., reinforcement learning neural network), based on the goal of selecting high variance actions.
initialize an adversarial learning table                         
                            
                                    Q
                                
                                    A
                                
                     for the adversarial agent; (PAN [page 4, Algorithm 1] teaches initializing  both protagonist and adversarial networks (i.e., table Q), where the networks receive as input actions and states during training.)
compute a plurality of updated adversarial learning tables based on the initialized adversarial learning table                         
                            
                                    Q
                                
                                    A
                                
                     […] (PAN [page 3, B. Reward Design and Risk Modeling] teaches: "When updating Q functions, our algorithm (like DQN [2]) samples a batch of data of size B from the replay buffer {(s,a,s0,r,done)t} B t=1 which, for each data point, includes the state, action, next state, reward, and task completion signal." PAN [page 1, Fig. 1] teaches: "The risk-averse protagonist and risk-seeking adversarial agents learn policies to maximize or minimize reward, respectively. The use of the adversary helps the protagonist to effectively explore risky states." PAN [page 4, Algorithm 1] teaches updating the Q networks:
                
                    U
                    p
                    d
                    a
                    t
                    e
                     
                            Q
                        
                            p
                        
                            i
                        
                    w
                    i
                    t
                    h
                     
                    R
                    
                            B
                        
                            p
                        
                    a
                    n
                    d
                     
                            M
                        
                            i
                        
                    ,
                     
                    i
                    =
                    1,2
                    ,
                    …
                    ,
                    k
                    ;
                
                    U
                    p
                    d
                    a
                    t
                    e
                     
                            Q
                        
                            A
                        
                            i
                        
                    w
                    i
                    t
                    h
                     
                    R
                    
                            B
                        
                            A
                        
                    a
                    n
                    d
                     
                            M
                        
                            i
                        
                    ,
                     
                    i
                    =
                    1,2
                    ,
                    …
                    ,
                    k
                    ;
                
Examiner’s note: under BRI, “compute a plurality of updated adversarial learning tables based on the initialized learning table                 
                    
                            Q
                        
                            A
                        
            ” can be interpreted as updating Q functions, which in algorithm 1 the updating takes place after the Q networks have been initialized (i.e., based on the initialized learning table                 
                    
                            Q
                        
                            A
                        
            ).)
[…] using a second utility function, the second utility function comprising a monotonically increasing convex function; (DIXON [page 358, Example 10.8] teaches: "In particular, one popular choice is given by the exponential utility function                         
                            U
                            
                                    X
                                
                            =
                            -
                            e
                            x
                            p
                            (
                            -
                            γ
                            X
                            )
                        
                    , (10.18)". Examiner’s note: while the formula in DIXON has the negative parameter, MIHATSCH [page 272, section 4. Risk-sensitive control based on exponential utilities] teaches: "Therefore, the objective is risk-averse for β < 0 and risk-seeking for β > 0." Moreover, PAN [page 1, Fig. 1] teaches: "The risk-averse protagonist and risk-seeking adversarial agents learn policies to maximize or minimize reward, respectively.” Additionally, the purpose of DIXON's utility function is to maximize the utility (reward) function of agents, and the adversarial agent's objectives are risk-seeking, therefore a person having ordinary skill in the art would have been able to determine that a parameter −γ (i.e.,                         
                            β
                            <
                            0
                        
                    ) would have minimized the adversarial agent’s goal, thus changing the parameter condition to                         
                            β
                            >
                            0
                        
                     would then maximize the reward for the adversarial agent, and would make the function a monotonically increasing convex function.)
generate an averaged adversarial learning table                         
                            
                                    Q
                                
                                    A
                                
                    ' based on the plurality of updated adversarial learning tables. (PAN [page 3, III. Risk Averse Robust Adversarial RL] teaches: " Defining the mean as                         
                            
                                    Q
                                
                                ~
                            
                                    s
                                    ,
                                    a
                                
                            =
                            
                                    1
                                
                                    k
                                
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                        k
                                    
                                            Q
                                        
                                            i
                                        
                                    (
                                    s
                                    ,
                                    a
                                    )
                                
                    . Additionally, the mean value                         
                            
                                    Q
                                
                                ~
                            
                            (
                            s
                            ,
                            a
                            )
                        
                      is used for both the protagonist and adversarial agents (i.e.,                         
                            
                                    Q
                                
                                    A
                                
                    ’) for selecting actions.)

Regarding Claim 7:
The combination of PAN/MIHATSCH/DIXON teaches the elements of claim 6 as outlined above. The combination of PAN/MIHATSCH/DIXON also teaches:
The computer-implemented system of claim 6, wherein said adversarial agent is configured to select an adversarial action based on the averaged adversarial learning table                         
                            
                                    Q
                                
                                    A
                                
                    ' to minimize a reward for the automated agent. (PAN [page 1, Fig. 1] teaches: "The risk-averse protagonist and risk-seeking adversarial agents learn policies to maximize or minimize reward, respectively." PAN [page 3, B. Reward Design and Risk Modeling] teaches: "Hereafter, we use Q to denote the entire Q value network, and use                         
                            
                                    Q
                                
                                    i
                                
                     to denote the                         
                            i
                        
                    -th head of the multi-heads Q value network" PAN [page 3, III. Risk Averse Robust Adversarial RL] teaches: " the input is passed to k different heads which perform one dense layer to obtain k action-value outputs:                         
                            {
                            
                                    Q
                                
                                    1
                                
                                    s
                                    ,
                                     
                                    ∙
                                
                            ,
                            …
                            ,
                             
                                    Q
                                
                                    k
                                
                                    s
                                    ,
                                    ∙
                                
                            }
                        
                    . Defining the mean as                         
                            
                                    Q
                                
                                ~
                            
                                    s
                                    ,
                                    a
                                
                            =
                            
                                    1
                                
                                    k
                                
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                        k
                                    
                                            Q
                                        
                                            i
                                        
                                    (
                                    s
                                    ,
                                    a
                                    )
                                
                    , the variance of a single action a is,
                
                    V
                    a
                    
                            r
                        
                            k
                        
                            Q
                            
                                    s
                                    ,
                                    a
                                
                    =
                    
                            1
                        
                            k
                        
                            ∑
                            
                                i
                                =
                                1
                            
                                k
                            
                                                    Q
                                                
                                                    i
                                                
                                                    s
                                                    ,
                                                    a
                                                
                                            -
                                            
                                                    Q
                                                
                                                ~
                                            
                                                    s
                                                    ,
                                                    a
                                                
                                    2
                                
where we use the k subscripts to indicate variance over k models, as in Equations 1 and 2.” PAN [page 3, III. Risk Averse Robust Adversarial RL] teaches: "At test time, the mean value                 
                    
                            Q
                        
                        ~
                    
                    (
                    s
                    ,
                    a
                    )
                
             is used for selecting actions.” PAN [page 4, IV. Experiments] teaches: "The vehicle can execute nine actions: […]". Examiner's note: the                 
                    i
                
            -th head of the multi heads refers to each respective agent in algorithm 1 "Initialize:                 
                    
                            Q
                        
                            P
                        
                            i
                        
                    ,
                    
                            Q
                        
                            A
                        
                            i
                        
                            i
                            =
                            1
                            ,
                            …
                            ,
                            k
                        
                    ;
                
            ”. Additionally, the mean value                 
                    
                            Q
                        
                        ~
                    
                    (
                    s
                    ,
                    a
                    )
                
              is used for both the protagonist and adversarial agents for selecting actions.)

Regarding Claim 8:
The combination of PAN/MIHATSCH/DIXON teaches the elements of claim 6 as outlined above. The combination of PAN/MIHATSCH/DIXON also teaches:
The computer-implemented system of claim 6, wherein the second utility function is represented by                         
                            
                                    u
                                
                                    A
                                
                                    x
                                
                            =
                            -
                            
                                    e
                                
                                            β
                                        
                                            A
                                        
                                    x
                                
                            ,
                             
                                    β
                                
                                    A
                                
                            >
                            0
                        
                    . (DIXON [page 358, Example 10.8] teaches: "In particular, one popular choice is given by the exponential utility function                         
                            U
                            
                                    X
                                
                            =
                            -
                            e
                            x
                            p
                            (
                            -
                            γ
                            X
                            )
                        
                    , (10.18)". Examiner’s note: while the formula in DIXON has the negative parameter, MIHATSCH [page 272, section 4. Risk-sensitive control based on exponential utilities] teaches: "Therefore, the objective is risk-averse for β < 0 and risk-seeking for β > 0." Moreover, PAN [page 1, Fig. 1] teaches: "The risk-averse protagonist and risk-seeking adversarial agents learn policies to maximize or minimize reward, respectively.” Additionally, the purpose of DIXON's utility function is to maximize the utility (reward) function of agents, and the adversarial agent's objectives are risk-seeking, thus a person having ordinary skill in the art would have been able to determine that a parameter −γ (i.e.,                         
                            β
                            <
                            0
                        
                    ) would have minimized the adversarial agent’s goal, thus changing the parameter condition to                         
                            β
                            >
                            0
                        
                     would then maximize the reward for the adversarial agent, and would make the function a monotonically increasing convex function. Therefore, under BRI, the “second utility function” can be interpreted as DIXON’s utility function with parameters such as to maximize the reward of the adversarial agent, which is to minimize the reward of the protagonist (i.e., automated agent).)

Regarding Claim 9:
The combination of PAN/MIHATSCH/DIXON teaches the elements of claim 6 as outlined above. The combination of PAN/MIHATSCH/DIXON also teaches:
The computer-implemented system of claim 6, wherein computing a plurality of updated adversarial learning tables comprises:
receiving, by way of said communication interface, an input parameter k and a training step parameter T; and (PAN [page 4, Algorithm 1] teaches: "Input: Training steps T; Environment                         
                            e
                            n
                            v
                        
                    ; Adversarial Action Schedule                         
                            Ξ
                        
                    ; Exploration rate                         
                            ε
                        
                    ; Number of models                         
                            k
                        
                    ." PAN [page 2, I. Introduction] teaches: "We use a discrete control task, autonomous driving with the TORCS [15] simulator". Examiner's note: under BRI, "said communication interface" can be interpreted as the TORCS simulator.)
for each training step t, where t= 1, 2... T: (PAN [page 4, Algorithm 1] teaches: “While                         
                            t
                            <
                            T
                        
                     do”)
computing an interim adversarial learning table                         
                            
                                            Q
                                        
                                        ^
                                    
                                    A
                                
                     based on the initialized adversarial learning table                         
                            
                                    Q
                                
                                    A
                                
                    ; (PAN [page 4, Algorithm 1] teaches: “"Initialize:                         
                            
                                    Q
                                
                                    P
                                
                                    i
                                
                            ,
                            
                                    Q
                                
                                    A
                                
                                    i
                                
                                    i
                                    =
                                    1
                                    ,
                                    …
                                    ,
                                    k
                                
                            ;
                        
                    ” PAN [page 4, Algorithm 1] teaches: “Choose Agent g from                         
                            {
                            A
                            
                                    A
                                    d
                                    v
                                    e
                                    r
                                    s
                                    a
                                    r
                                    i
                                    a
                                    l
                                     
                                    a
                                    g
                                    e
                                    n
                                    t
                                
                            ,
                             
                            P
                            
                                    P
                                    r
                                    o
                                    t
                                    a
                                    g
                                    o
                                    n
                                    i
                                    s
                                    t
                                     
                                    a
                                    g
                                    e
                                    n
                                    t
                                
                            }
                        
                     according to                         
                            Ξ
                        
                    ; Compute                         
                            
                                            Q
                                        
                                        ^
                                    
                                    g
                                
                                    s
                                    ,
                                    a
                                
                     according to (1) and (2) ;” Furthermore, PAN [page 3, A. Two Player Reinforcement Learning] teaches: “To encourage the adversary to systematically seek adverse outcomes, its modified value function for action selection is
                
                                    Q
                                
                                ^
                            
                            A
                        
                            s
                            ,
                            a
                        
                    =
                    
                            Q
                        
                            A
                        
                            s
                            ,
                            a
                        
                    +
                    
                            λ
                        
                            A
                        
                    V
                    a
                    
                            r
                        
                            k
                        
                                    Q
                                
                                    A
                                
                                    k
                                
                                    s
                                    ,
                                    a
                                
                    ,
                     
                    (
                    2
                    )
                
where                 
                    
                                    Q
                                
                                ^
                            
                            A
                        
                            s
                            ,
                            a
                        
             is the modified Q function,                 
                    
                            Q
                        
                            A
                        
                            s
                            ,
                            a
                        
             is the original Q function,                 
                    V
                    a
                    
                            r
                        
                            k
                        
                                    Q
                                
                                    A
                                
                                    k
                                
                                    s
                                    ,
                                    a
                                
             is the variance of the Q function across                 
                    k
                
             different models, and                 
                    
                            λ
                        
                            A
                        
             is a constant. Examiner’s note: under BRI, “an interim adversarial learning table” can be interpreted as the modified function                 
                    
                                    Q
                                
                                ^
                            
                            A
                        
                            s
                            ,
                            a
                        
             which is based on the initialized Q networks                 
                    
                            Q
                        
                            P
                        
                            i
                        
                    ,
                    
                            Q
                        
                            A
                        
                            i
                        
                            i
                            =
                            1
                            ,
                            …
                            ,
                            k
                        
            .)
selecting an adversarial action                         
                            
                                    a
                                
                                    t
                                
                                    A
                                
                     based on the interim adversarial learning table                         
                            
                                            Q
                                        
                                        ^
                                    
                                    A
                                
                     and a given state                         
                            
                                    s
                                
                                    t
                                
                     from the plurality of states; (PAN [page 4, Algorithm 1] teaches: “Select action according to                         
                            
                                            Q
                                        
                                        ^
                                    
                                    g
                                
                                    s
                                    ,
                                    a
                                
                     by applying                         
                            ε
                        
                    -greedy strategy;” Examiner’s note: the same selecting process applies to both the protagonist and adversarial agents with their respective parameters.)).
computing an adversarial reward                         
                            
                                    r
                                
                                    t
                                
                                    A
                                
                     and a next state                         
                            
                                    s
                                
                                    t
                                    +
                                    1
                                
                     based on the selected adversarial action                         
                            
                                    a
                                
                                    t
                                
                                    A
                                
                    ; and (PAN [page 4, Algorithm 1] teaches: "Execute action and get obs; reward; done;" Examiner's note: under BRI, the obs can be interpreted as the next state. The same computing process applies to both the protagonist and adversarial agents with their respective parameters.))
for at least two values of i, where i= 1, 2, ..., k, computing a respective updated adversarial learning table                         
                            
                                    Q
                                
                                    A
                                
                                    i
                                
                     of the plurality of updated adversarial learning tables based on (                        
                            
                                    s
                                
                                    t
                                
                            ,
                             
                                    a
                                
                                    t
                                
                                    A
                                
                            ,
                             
                                    r
                                
                                    t
                                
                                    A
                                
                            ,
                             
                                    s
                                
                                    t
                                    +
                                    1
                                
                    ) and the second utility function. (PAN [page 4, Algorithm 1] teaches:
                
                    G
                    e
                    n
                    e
                    r
                    a
                    t
                    e
                     
                    m
                    a
                    s
                    k
                     
                    M
                    ∈
                    
                            R
                        
                            k
                        
                    ~
                    P
                    o
                    i
                    s
                    s
                    o
                    n
                    (
                    q
                    )
                
                    U
                    p
                    d
                    a
                    t
                    e
                     
                            Q
                        
                            p
                        
                            i
                        
                    w
                    i
                    t
                    h
                     
                    R
                    
                            B
                        
                            p
                        
                    a
                    n
                    d
                     
                            M
                        
                            i
                        
                    ,
                     
                    i
                    =
                    1,2
                    ,
                    …
                    ,
                    k
                    ;
                
                    U
                    p
                    d
                    a
                    t
                    e
                     
                            Q
                        
                            A
                        
                            i
                        
                    w
                    i
                    t
                    h
                     
                    R
                    
                            B
                        
                            A
                        
                    a
                    n
                    d
                     
                            M
                        
                            i
                        
                    ,
                     
                    i
                    =
                    1,2
                    ,
                    …
                    ,
                    k
                    ;
                
Additionally, PAN [page 3, B. Reward Design and Risk Modeling] teaches: “When updating Q functions, our algorithm (like DQN [2]) samples a batch of data of size B from the replay buffer                 
                    
                                                    s
                                                    ,
                                                    a
                                                    ,
                                                    
                                                            s
                                                        
                                                            '
                                                        
                                                    ,
                                                    r
                                                    ,
                                                    d
                                                    o
                                                    n
                                                    e
                                                
                                            t
                                        
                            t
                            =
                            1
                        
                            B
                        
             which, for each data point, includes the state, action, next state, reward, and task completion signal.” Examiner’s note: the same updating process applies to both protagonist and adversarial agents with their respective parameters.)

Regarding Claim 10:
The combination of PAN/MIHATSCH/DIXON teaches the elements of claim 9 as outlined above. The combination of PAN/MIHATSCH/DIXON also teaches:
The computer-implemented system of claim 9, wherein the averaged adversarial learning table                         
                            
                                    Q
                                
                                    A
                                
                    ’ is computed as                          
                            
                                    1
                                
                                    k
                                
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                        k
                                    
                                            Q
                                        
                                            A
                                        
                                            i
                                        
                    . (PAN [page 3, III. Risk Averse Robust Adversarial RL] teaches: " Defining the mean as                         
                            
                                    Q
                                
                                ~
                            
                                    s
                                    ,
                                    a
                                
                            =
                            
                                    1
                                
                                    k
                                
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                        k
                                    
                                            Q
                                        
                                            i
                                        
                                    (
                                    s
                                    ,
                                    a
                                    )
                                
                    ”.)

Regarding Claim 11:
The claim recites similar limitations as corresponding claim 1 and is rejected for similar reasons as claim 1 using similar teachings and rationale.

Regarding Claim 12:
The combination of PAN/MIHATSCH/DIXON teaches the elements of claim 11 as outlined above. Additionally, the claim recites similar limitations as corresponding claim 2 and is rejected for similar reasons as claim 2 using similar teachings and rationale.

Regarding Claim 13:
The combination of PAN/MIHATSCH/DIXON teaches the elements of claim 11 as outlined above. Additionally, the claim recites similar limitations as corresponding claim 3 and is rejected for similar reasons as claim 3 using similar teachings and rationale.

Regarding Claim 14:
The combination of PAN/MIHATSCH/DIXON teaches the elements of claim 11 as outlined above. Additionally, the claim recites similar limitations as corresponding claim 4 and is rejected for similar reasons as claim 4 using similar teachings and rationale.

Regarding Claim 15:
The combination of PAN/MIHATSCH/DIXON teaches the elements of claim 11 as outlined above. Additionally, the claim recites similar limitations as corresponding claim 5 and is rejected for similar reasons as claim 5 using similar teachings and rationale.

Regarding Claim 16:
The combination of PAN/MIHATSCH/DIXON teaches the elements of claim 11 as outlined above. Additionally, the claim recites similar limitations as corresponding claim 6 and is rejected for similar reasons as claim 6 using similar teachings and rationale.

Regarding Claim 17:
The combination of PAN/MIHATSCH/DIXON teaches the elements of claim 16 as outlined above. Additionally, the claim recites similar limitations as corresponding claim 7 and is rejected for similar reasons as claim 7 using similar teachings and rationale.

Regarding Claim 18:
The combination of PAN/MIHATSCH/DIXON teaches the elements of claim 16 as outlined above. Additionally, the claim recites similar limitations as corresponding claim 8 and is rejected for similar reasons as claim 8 using similar teachings and rationale.

Regarding Claim 19:
The combination of PAN/MIHATSCH/DIXON teaches the elements of claim 16 as outlined above. Additionally, the claim recites similar limitations as corresponding claim 9 and is rejected for similar reasons as claim 9 using similar teachings and rationale.

Regarding Claim 20:
The combination of PAN/MIHATSCH/DIXON teaches the elements of claim 19 as outlined above. Additionally, the claim recites similar limitations as corresponding claim 10 and is rejected for similar reasons as claim 10 using similar teachings and rationale.

Regarding Claim 21:
PAN in view of MIHATSCH and DIXON teaches the elements of claim 1 as outlined above. MIHATSCH further teaches:
wherein the utility function is applied to a temporal difference error between a first state t and a next state t+1, each state associated with an action […]. (MIHATSCH [page 268, section 1. Introduction] teaches: “Instead of transforming the cumulative return of the process as in utility theory, we transform the temporal differences (so-called TD-errors) (i.e., utility function is applied to a temporal difference error) which play an important role during the procedure of learning the value or Q-function.” MIHATSCH [page 279, section 5] teaches: “Let                         
                            
                                            i
                                        
                                            0
                                        
                                    ,
                                    
                                            i
                                        
                                            1
                                        
                                    ,
                                    
                                            i
                                        
                                            2
                                        
                                    ,
                                    …
                                
                     be the sequence of states (i.e., a first state t and a next state t+1) that we obtain while interacting with the system and let                          
                            
                                            J
                                        
                                        ^
                                    
                                    t
                                
                     denote the value function approximation available after the                         
                            t
                        
                    -th time step. The risk-sensitive TD(0) algorithm updates                          
                            
                                            J
                                        
                                        ^
                                    
                                    t
                                
                      according to

    PNG
    media_image1.png
    109
    666
    media_image1.png
    Greyscale

where the stepsizes                 
                    
                            σ
                        
                            t
                            -
                            1
                        
                    (
                    i
                    )
                
             are defined to be nonzero only for the current state                 
                    
                            i
                        
                            t
                            -
                            1
                        
            . […] Let                 
                    
                                    i
                                
                                    0
                                
                            ,
                            
                                    i
                                
                                    1
                                
                            ,
                            
                                    i
                                
                                    2
                                
                            ,
                            …
                        
            be the sequence of states and actions (i.e., each state associated with an action) which we encounter while interacting with the system and let                  
                    
                                    Q
                                
                                ^
                            
                            t
                        
             denote the current approximation available after time step                 
                    t
                
            .” Examiner’s note: The transformation function                 
                    
                            χ
                        
                            K
                        
             (i.e., utility function) is applied to the TD-errors (i.e., temporal difference errors) computed between state                 
                    
                            i
                        
                            t
                            -
                            1
                        
             (a first state) and                 
                    
                            i
                        
                            t
                        
             (next state).)
PAN further teaches: each state associated with an action of the automated agent. (PAN [page 3, B. Reward Design and Risk Modeling] teaches: "When updating Q functions, our algorithm (like DQN [2]) samples a batch of data of size B from the replay buffer                         
                            
                                                            s
                                                            ,
                                                            a
                                                            ,
                                                            
                                                                    s
                                                                
                                                                    '
                                                                
                                                            ,
                                                            r
                                                            ,
                                                            d
                                                            o
                                                            n
                                                            e
                                                        
                                                    t
                                                
                                    t
                                    =
                                    1
                                
                                    B
                                
                     which, for each data point, includes the state, action, next state, reward, and task completion signal.” PAN [page 3, III. Risk Averse Robust Adversarial RL] teaches: "At test time, the mean value                         
                            
                                    Q
                                
                                ~
                            
                            (
                            s
                            ,
                            a
                            )
                        
                     is used for selecting actions." PAN [page 4, IV. Experiments] teaches: "The vehicle can execute nine actions: […]".)

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).!!!!!!)
A shortened statutory period for reply to this final action is set to expire THREE
MONTHS from the mailing date of this action. In the event a first reply is filed within TWO
MONTHS of the mailing date of this final action and the advisory action is not mailed until after
the end of the THREE-MONTH shortened statutory period, then the shortened statutory period
will expire on the date the advisory action is mailed, and any nonprovisional extension fee
(37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the
advisory action. In no event, however, will the statutory period for reply expire later than SIX
MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Alvaro S Laham Bauzo whose telephone number is (571)272-5650. The examiner can normally be reached Mon-Fri 7:30 AM - 11:00 AM | 1:00 PM - 5:30 PM ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Usmaan Saeed can be reached on (571) 272-4046. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/A.S.L./Examiner, Art Unit 2146

/USMAAN SAEED/Supervisory Patent Examiner, Art Unit 2146
Read full office action
Prosecution Timeline

Jun 10, 2022
Application Filed
Jul 11, 2025
Non-Final Rejection mailed — §101, §103
Jan 12, 2026
Response Filed
Mar 23, 2026
Final Rejection mailed — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/646,082
Patent 12632705
ADVERSARIAL 3D DEFORMATIONS LEARNING
4y 4m to grant Granted May 19, 2026
17/758,166
Patent 12475388
MACHINE LEARNING MODEL SEARCH METHOD, RELATED APPARATUS, AND DEVICE
3y 4m to grant Granted Nov 18, 2025
Study what changed to get past this examiner. Based on 2 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
25%
Grant Probability
99%
With Interview (+100.0%)
4y 1m (~2m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 4 resolved cases by this examiner. Grant probability derived from career allowance rate.
SYSTEM AND METHOD FOR RISK SENSITIVE REINFORCEMENT LEARNING ARCHITECTURE

This examiner grants 25% of cases after interview

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

SYSTEM AND METHOD FOR RISK SENSITIVE REINFORCEMENT LEARNING ARCHITECTURE

This examiner grants 25% of cases after interview

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email