Last updated: May 29, 2026
Application No. 17/748,204
REINFORCEMENT LEARNING AGENT SIMULATION TO MEASURE MONITORING SYSTEM STRENGTH

Final Rejection §103
Filed
May 19, 2022
Examiner
ILES, TYLER EDWARD
Art Unit
2122
Tech Center
2100 — Computer Architecture & Software
Assignee
Oracle Financial Services Software Limited
OA Round
2 (Final)
This examiner grants 60% of cases after interview

— +50.0% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 5 resolved cases, 2023–2026
Examiner Intelligence

ILES, TYLER EDWARD View full profile →
Grants 60% of resolved cases
Career Allowance Rate
3 granted / 5 resolved
+5.0% vs TC avg
Strong +50% interview lift
Without
With
+50.0%
Interview Lift
resolved cases with interview
Typical timeline
3y 8m
Avg Prosecution
9 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
13.2%
-26.8% vs TC avg
§103
84.2%
+44.2% vs TC avg
§102
2.6%
-37.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 5 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . This action is n response to an amendment filed on November 3rd, 2025. Claims 1-20 are pending in the current application.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-5, 8-12, and 15-19, are rejected under 35 U.S.C. 103 as being unpatentable over Di Wu et al. (Herein referred to as Wu) (Evading Machine Learning Botnet Detection Models via Deep Reinforcement Learning) in view of Narma Mhaisen et al. (Herein referred to as Mhaisen) (To chain or not to chain: A reinforcement learning approach for blockchain-enabled IoT monitoring applications) and in further view of Borrajo et al. (Herein referred to as Borrajo) (Simulating and classifying behavior in adversarial environments based on action-state traces: an application to money laundering)

Regarding claim 1, Wu teaches a method comprising: training a reinforcement learning agent (“The gym framework provides a standardized environment to produce benchmarks and trains the RL agent through some methods: reset, step, and render.”, pg. 3, right column, under “Fig. 1. Structure of the framework”) actions of the reinforcement learning agent with transition values, (“P is a Markovian transition model, where P(s, a, s[Symbol font/0xA2]) is the probability of making a transition to state s[Symbol font/0xA2] when taking action a in state s”, pg. 3, under “A. Deep Reinforcement Learning”) an electronic policy data structure that controls the reinforcement learning agent, when executed by one or more processors, to perform a task (“The agent follows a policy [Symbol font/0x70] (a|s) based on the estimated value determined by the s and r.”, pg. 3, left column, under “A. Deep Reinforcement Learning”) and evade one or more scenarios of a monitoring system that are configured to hinder the task; (“A reinforcement learning model consists of an agent and an environment. For each turn, the environment receives the action a chosen by the agent, and feeds back the observed state s[Symbol font/0xA2] (after executing a) and reward r. The agent follows a policy [Symbol font/0x70] (a|s) based on the estimated value determined by the s and r… In the context of botnet traffic evasion, we apply DQN in a reinforcement learning framework, as shown in Figure 1. “, pg. 3, left column under “A. Deep Reinforcement Learning “; pg. 3, right column, under “B. Framework Structure”) (The agent is configured to evade detection from a deep learning detector. The detector corresponds to a monitoring system) analyzing the steps taken in the episode to measure a strength of monitoring in the monitoring system; (“Through the score for each query, the attacker is able to directly measure the efficacy of any perturbation to the target black-box model.” Pg. 2, right column, under Score-based attack) (The more effective the attacker, the less effective the monitoring system is. For figures demonstrating this, see Wu’s TABLE III below) and presenting the strength of monitoring in an interface. (“…we use OpenAI gym as the environment interface, which is a toolkit for developing and comparing reinforcement learning algorithms. The gym framework provides a standardized environment to produce benchmarks”, pg. 3, right column, under “Fig. 1. Structure of the framework”) (Also see Wu’s TABLE III below for statistics involving Detection Model evasions, which teaches the strength of the monitoring system)
However, Wu does not explicitly teach to learn an electronic policy data structure that maps states of an electronic transaction system nor sampling the policy to simulate an episode of steps taken by the reinforcement learning agent, wherein the sampling includes accessing the transition values stored in the electronic policy data structure to select an action by the reinforcement learning agent, and write a step row into an episode data structure that records alert states of one or more scenarios resulting from executing the action;
Mhaisen teaches sampling the policy to simulate an episode of steps taken by the reinforcement learning agent. (“The agent’s goal is to find an optimal policy, π∗, which is the policy that would result in maximum reward throughout the monitoring episodes”, Pg. 43, under 4.3.1 State transitions) (The agent would constantly sample the policy at every step to form an episode of steps, as to further the agent’s goal and find an optimal policy) wherein the sampling includes accessing the transition values stored in the electronic policy data structure to select an action by the reinforcement learning agent, (“Possible sampled state and action are denoted s, a. P(·|St, At) defines the probability distribution over the next states St+1 conditioned on the current state and action.”, pg. 5, under “4.2.1. State transitions”) and write a step row into an episode data structure that records alert states of one or more scenarios resulting from executing the action; (“The agent starts the training process by repeatedly generating episodes of experience (lines 3-6) and collecting experience tuples through interactions (lines 8-10). These tuples of state–action transitions represent the experience data and are stored in the replay buffer D to be used in the learning process.”, pg. 6, left column, bottom paragraph) (The “alert states” are interpreted to be states in which a monitoring system detects an agent, which is taught by the states of Wu.)
Therefore, it would have been considered obvious to one of ordinary skill in the art,
prior to the current application’s filing date, to combine Wu’s Adversarial Reinforcement Learning Agent, with the generation of episodes for the agent to learn from, as disclosed by Mhaisen. One would be motivated to combine the two teachings, prior to the filing date of the current application, as episodes allow for the RL agent to train and learn an optimal policy, as disclosed in Mhaisen. (“During an episode, states are transitioned and reward values are emitted according to the details in the following subsections. It is essential to mention that the agent does not know the environment design. It is supposed to learn an optimal policy (strategy) by solely interacting and observing rewards for different decisions taken in different states… The agent’s goal is to find an optimal policy, π∗, which is the policy that would result in maximum reward throughout the monitoring episodes.”, pg. 43, first paragraph of the left column and right column.)
	However, Wu, as modified by Mhaisen does not explicitly teach to learn an electronic policy data structure that maps states of an electronic transaction system.
	Borrajo teaches an electronic transaction system electronic transaction system. (See Tables 1, 2 and 3) with states to learn a policy data structure (“The learning system takes as input traces of observable behavior. A trace 𝑡𝐶 is a sequence of states and actions executed by 𝐶 in those states: 𝑡𝐶 = (𝑠0, 𝑎1, 𝑠1, 𝑎2, 𝑠2, . . . , 𝑠𝑛−1, 𝑎𝑛, 𝑠𝑛), where 𝑠𝑖 is a state and 𝑎𝑖 is an action name and its parameters. States and actions correspond to the observable predicates and actions from the viewpoint of 𝐹 .”, pg. 3, right column, under “2.3 Traces of Behavior”) (While policies are not taught in this disclosure, the actions that are map to states of an electronic transaction system are taught, which in combination with the learning of a policy as disclosed in Wu, the limitation is fully taught.)
Therefore, it would have been considered obvious to one of ordinary skill in the art, prior to the current application’s filing date, to combine Wu’s Adversarial Reinforcement Learning Agent, modify it with Mhaisen, and combine it with the actions, predicates, and functions of Borrajo. One would be motivated to combine the teachings, prior to the filing date of the current application, as Borrajo’s method helps detect and classify good and bad behavior in a financial setting, as disclosed in Borrajo. (“The learning task can be defined as follows. Given: 𝑁 classes of behavior, ({good, bad} in our current application);3 and a set of labeled observed traces, 𝑇𝐶𝑖 , ∀𝐶𝑖 ∈ {g𝑜𝑜𝑑, 𝑏𝑎𝑑} Obtain: a classifier that takes as input a new (partial) trace 𝑡 (with unknown class) and outputs the predicted class… Good refers to standard customers’ behavior and bad corresponds to money laundering-related behavior.” Pg. 4, left column, under “3.1 Learning Task”)

Regarding claim 8, Wu teaches A computing system comprising: a processor; a memory operably connected to the processor; a non-transitory computer-readable medium operably connected to the processor and memory and storing computer-executable instructions  (“In this paper, we propose a more general framework based on deep reinforcement learning (DRL), which effectively generates adversarial traffic flows to deceive the detection model by automatically adding perturbations to samples.”, pg. 1 Abstract) (This teaches the computer system, processor, and non-transitory computer readable medium operably connected to the processor and memory, as one would need a system comprising such features to run a method that generates adversarial traffic flows) training a reinforcement learning agent (“The gym framework provides a standardized environment to produce benchmarks and trains the RL agent through some methods: reset, step, and render.”, pg. 3, right column, under “Fig. 1. Structure of the framework”) actions of the reinforcement learning agent with transition values, (“P is a Markovian transition model, where P(s, a, s[Symbol font/0xA2]) is the probability of making a transition to state s[Symbol font/0xA2] when taking action a in state s”, pg. 3, under “A. Deep Reinforcement Learning”) an electronic policy data structure that controls the reinforcement learning agent, when executed by one or more processors, to perform a task (“The agent follows a policy [Symbol font/0x70] (a|s) based on the estimated value determined by the s and r.”, pg. 3, left column, under “A. Deep Reinforcement Learning”) and evade one or more scenarios of a monitoring system that are configured to hinder the task; (“A reinforcement learning model consists of an agent and an environment. For each turn, the environment receives the action a chosen by the agent, and feeds back the observed state s[Symbol font/0xA2] (after executing a) and reward r. The agent follows a policy [Symbol font/0x70] (a|s) based on the estimated value determined by the s and r… In the context of botnet traffic evasion, we apply DQN in a reinforcement learning framework, as shown in Figure 1. “, pg. 3, left column under “A. Deep Reinforcement Learning “; pg. 3, right column, under “B. Framework Structure”) (The agent is configured to evade detection from a deep learning detector. The detector corresponds to a monitoring system) analyzing the steps taken in the episode to measure a strength of monitoring in the monitoring system; (“Through the score for each query, the attacker is able to directly measure the efficacy of any perturbation to the target black-box model.” Pg. 2, right column, under Score-based attack) (The more effective the attacker, the less effective the monitoring system is. For figures demonstrating this, see Wu’s TABLE III below) and presenting the strength of monitoring in an interface. (“…we use OpenAI gym as the environment interface, which is a toolkit for developing and comparing reinforcement learning algorithms. The gym framework provides a standardized environment to produce benchmarks”, pg. 3, right column, under “Fig. 1. Structure of the framework”) (Also see Wu’s TABLE III below for statistics involving Detection Model evasions, which teaches the strength of the monitoring system)
However, Wu does not explicitly teach to learn an electronic policy data structure that maps states of an electronic transaction system nor to transfer an amount from a source account to a destination account nor one or more scenarios of a monitoring system that are configure to hinder a transfer, nor sampling the policy to simulate an episode of steps taken by the reinforcement learning agent, wherein the sampling includes accessing the transition values stored in the electronic policy data structure to select an action by the reinforcement learning agent, and write a step row into an episode data structure that records alert states of one or more scenarios resulting from executing the action;
Mhaisen teaches sampling the policy to simulate an episode of steps taken by the reinforcement learning agent. (“The agent’s goal is to find an optimal policy, π∗, which is the policy that would result in maximum reward throughout the monitoring episodes”, Pg. 43, under 4.3.1 State transitions) (The agent would constantly sample the policy at every step to form an episode of steps, as to further the agent’s goal and find an optimal policy) wherein the sampling includes accessing the transition values stored in the electronic policy data structure to select an action by the reinforcement learning agent, (“Possible sampled state and action are denoted s, a. P(·|St, At) defines the probability distribution over the next states St+1 conditioned on the current state and action.”, pg. 5, under “4.2.1. State transitions”) and write a step row into an episode data structure that records alert states of one or more scenarios resulting from executing the action; (“The agent starts the training process by repeatedly generating episodes of experience (lines 3-6) and collecting experience tuples through interactions (lines 8-10). These tuples of state–action transitions represent the experience data and are stored in the replay buffer D to be used in the learning process.”, pg. 6, left column, bottom paragraph) (The “alert states” are interpreted to be states in which a monitoring system detects an agent, which is taught by the states of Wu.)
Therefore, it would have been considered obvious to one of ordinary skill in the art,
prior to the current application’s filing date, to combine Wu’s Adversarial Reinforcement Learning Agent, with the generation of episodes for the agent to learn from, as disclosed by Mhaisen. One would be motivated to combine the two teachings, prior to the filing date of the current application, as episodes allow for the RL agent to train and learn an optimal policy, as disclosed in Mhaisen. (“During an episode, states are transitioned and reward values are emitted according to the details in the following subsections. It is essential to mention that the agent does not know the environment design. It is supposed to learn an optimal policy (strategy) by solely interacting and observing rewards for different decisions taken in different states… The agent’s goal is to find an optimal policy, π∗, which is the policy that would result in maximum reward throughout the monitoring episodes.”, pg. 43, first paragraph of the left column and right column.)
	However, Wu, as modified by Mhaisen does not explicitly teach to learn an electronic policy data structure that maps states of an electronic transaction system nor to transfer an amount from a source account to a destination account nor one or more scenarios of a monitoring system that are configure to hinder a transfer.
	Borrajo teaches an electronic transaction system electronic transaction system. (See Tables 1, 2 and 3) with states to learn a policy data structure (“The learning system takes as input traces of observable behavior. A trace 𝑡𝐶 is a sequence of states and actions executed by 𝐶 in those states: 𝑡𝐶 = (𝑠0, 𝑎1, 𝑠1, 𝑎2, 𝑠2, . . . , 𝑠𝑛−1, 𝑎𝑛, 𝑠𝑛), where 𝑠𝑖 is a state and 𝑎𝑖 is an action name and its parameters. States and actions correspond to the observable predicates and actions from the viewpoint of 𝐹 .”, pg. 3, right column, under “2.3 Traces of Behavior”) (While policies are not taught in this disclosure, the actions that are map to states of an electronic transaction system are taught, which in combination with the learning of a policy as disclosed in Wu, the limitation is fully taught.) to transfer an amount from a source account to a destination account (“For each action-state pair, we created standard attributes used by other works for the two partial observability models (under the bank and full models, we could observe all these attributes). Examples are average, min and max values of the previous transactions of each type (e.g. wires, or deposits), balance of accounts or number of connected accounts”, pg. 7, left column, bottom paragraph) and one or more scenarios of a monitoring system that are configure to hinder a transfer. (“In this paper, we focus on AML. Over time, financial institutions have been mandated by law enforcement agencies to improve their processes to detect suspicious activity and raise the corresponding Suspicious Activity Reports (SARs). A typical prevalent AML model starts by observing transactions, public media, or a referral, and generates alerts [11]. Then, alerts are investigated by humans who decide whether they need to report a SAR to law enforcement for the alert.”, pg. 2, right column, second paragraph) (AMLs are designed to detect suspicious activity and prevent fraudulent transactions, hindering the transfer of an amount of money.)
Therefore, it would have been considered obvious to one of ordinary skill in the art, prior to the current application’s filing date, to combine Wu’s Adversarial Reinforcement Learning Agent, modify it with Mhaisen, and combine it with the actions, predicates, and functions of Borrajo. One would be motivated to combine the teachings, prior to the filing date of the current application, as Borrajo’s method helps detect and classify good and bad behavior in a financial setting, as disclosed in Borrajo. (“The learning task can be defined as follows. Given: 𝑁 classes of behavior, ({good, bad} in our current application);3 and a set of labeled observed traces, 𝑇𝐶𝑖 , ∀𝐶𝑖 ∈ {g𝑜𝑜𝑑, 𝑏𝑎𝑑} Obtain: a classifier that takes as input a new (partial) trace 𝑡 (with unknown class) and outputs the predicted class… Good refers to standard customers’ behavior and bad corresponds to money laundering-related behavior.” Pg. 4, left column, under “3.1 Learning Task”)

Regarding claim 15, Wu teaches a non-transitory computer-readable medium having stored thereon computer-executable instructions (“In this paper, we propose a more general framework based on deep reinforcement learning (DRL), which effectively generates adversarial traffic flows to deceive the detection model by automatically adding perturbations to samples.”, pg. 1 Abstract) (This teaches the non-transitory computer readable medium, as one would need a medium to run a method that generates adversarial traffic flows) train a reinforcement learning agent (“The gym framework provides a standardized environment to produce benchmarks and trains the RL agent through some methods: reset, step, and render.”, pg. 3, right column, under “Fig. 1. Structure of the framework”) actions of the reinforcement learning agent with transition values, (“P is a Markovian transition model, where P(s, a, s[Symbol font/0xA2]) is the probability of making a transition to state s[Symbol font/0xA2] when taking action a in state s”, pg. 3, under “A. Deep Reinforcement Learning”) an electronic policy data structure that controls the reinforcement learning agent, when executed by one or more processors, to perform a task (“The agent follows a policy [Symbol font/0x70] (a|s) based on the estimated value determined by the s and r.”, pg. 3, left column, under “A. Deep Reinforcement Learning”) and evade one or more scenarios of a monitoring system that are configured to hinder the task; (“A reinforcement learning model consists of an agent and an environment. For each turn, the environment receives the action a chosen by the agent, and feeds back the observed state s[Symbol font/0xA2] (after executing a) and reward r. The agent follows a policy [Symbol font/0x70] (a|s) based on the estimated value determined by the s and r… In the context of botnet traffic evasion, we apply DQN in a reinforcement learning framework, as shown in Figure 1. “, pg. 3, left column under “A. Deep Reinforcement Learning “; pg. 3, right column, under “B. Framework Structure”) (The agent is configured to evade detection from a deep learning detector. The detector corresponds to a monitoring system) analyze the steps taken in the episode to measure a strength of monitoring in the monitoring system; (“Through the score for each query, the attacker is able to directly measure the efficacy of any perturbation to the target black-box model.” Pg. 2, right column, under Score-based attack) (The more effective the attacker, the less effective the monitoring system is. For figures demonstrating this, see Wu’s TABLE III below) and present the strength of monitoring in an interface. (“…we use OpenAI gym as the environment interface, which is a toolkit for developing and comparing reinforcement learning algorithms. The gym framework provides a standardized environment to produce benchmarks”, pg. 3, right column, under “Fig. 1. Structure of the framework”) (Also see Wu’s TABLE III below for statistics involving Detection Model evasions, which teaches the strength of the monitoring system)
However, Wu does not explicitly teach to learn an electronic policy data structure that maps states of an electronic transaction system nor one or more scenarios of a monitoring system that are configure to hinder a transfer nor to sample the policy to simulate an episode of steps taken by the reinforcement learning agent, wherein the sampling includes accessing the transition values stored in the electronic policy data structure to select an action by the reinforcement learning agent, and write a step row into an episode data structure that records alert states of one or more scenarios resulting from executing the action;
Mhaisen teaches to sample the policy to simulate an episode of steps taken by the reinforcement learning agent. (“The agent’s goal is to find an optimal policy, π∗, which is the policy that would result in maximum reward throughout the monitoring episodes”, Pg. 43, under 4.3.1 State transitions) (The agent would constantly sample the policy at every step to form an episode of steps, as to further the agent’s goal and find an optimal policy) wherein the sampling includes accessing the transition values stored in the electronic policy data structure to select an action by the reinforcement learning agent, (“Possible sampled state and action are denoted s, a. P(·|St, At) defines the probability distribution over the next states St+1 conditioned on the current state and action.”, pg. 5, under “4.2.1. State transitions”) and write a step row into an episode data structure that records alert states of one or more scenarios resulting from executing the action; (“The agent starts the training process by repeatedly generating episodes of experience (lines 3-6) and collecting experience tuples through interactions (lines 8-10). These tuples of state–action transitions represent the experience data and are stored in the replay buffer D to be used in the learning process.”, pg. 6, left column, bottom paragraph) (The “alert states” are interpreted to be states in which a monitoring system detects an agent, which is taught by the states of Wu.)
Therefore, it would have been considered obvious to one of ordinary skill in the art,
prior to the current application’s filing date, to combine Wu’s Adversarial Reinforcement Learning Agent, with the generation of episodes for the agent to learn from, as disclosed by Mhaisen. One would be motivated to combine the two teachings, prior to the filing date of the current application, as episodes allow for the RL agent to train and learn an optimal policy, as disclosed in Mhaisen. (“During an episode, states are transitioned and reward values are emitted according to the details in the following subsections. It is essential to mention that the agent does not know the environment design. It is supposed to learn an optimal policy (strategy) by solely interacting and observing rewards for different decisions taken in different states… The agent’s goal is to find an optimal policy, π∗, which is the policy that would result in maximum reward throughout the monitoring episodes.”, pg. 43, first paragraph of the left column and right column.)
	However, Wu, as modified by Mhaisen does not explicitly teach to learn an electronic policy data structure that maps states of an electronic transaction system nor one or more scenarios of a monitoring system that are configure to hinder a transfer.
	Borrajo teaches an electronic transaction system electronic transaction system. (See Tables 1, 2 and 3) with states to learn a policy data structure (“The learning system takes as input traces of observable behavior. A trace 𝑡𝐶 is a sequence of states and actions executed by 𝐶 in those states: 𝑡𝐶 = (𝑠0, 𝑎1, 𝑠1, 𝑎2, 𝑠2, . . . , 𝑠𝑛−1, 𝑎𝑛, 𝑠𝑛), where 𝑠𝑖 is a state and 𝑎𝑖 is an action name and its parameters. States and actions correspond to the observable predicates and actions from the viewpoint of 𝐹 .”, pg. 3, right column, under “2.3 Traces of Behavior”) (While policies are not taught in this disclosure, the actions that are map to states of an electronic transaction system are taught, which in combination with the learning of a policy as disclosed in Wu, the limitation is fully taught.) and one or more scenarios of a monitoring system that are configure to hinder a transfer. (“In this paper, we focus on AML. Over time, financial institutions have been mandated by law enforcement agencies to improve their processes to detect suspicious activity and raise the corresponding Suspicious Activity Reports (SARs). A typical prevalent AML model starts by observing transactions, public media, or a referral, and generates alerts [11]. Then, alerts are investigated by humans who decide whether they need to report a SAR to law enforcement for the alert.”, pg. 2, right column, second paragraph) (AMLs are designed to detect suspicious activity and prevent fraudulent transactions, hindering the transfer of an amount of money.)

Therefore, it would have been considered obvious to one of ordinary skill in the art, prior to the current application’s filing date, to combine Wu’s Adversarial Reinforcement Learning Agent, modify it with Mhaisen, and combine it with the actions, predicates, and functions of Borrajo. One would be motivated to combine the teachings, prior to the filing date of the current application, as Borrajo’s method helps detect and classify good and bad behavior in a financial setting, as disclosed in Borrajo. (“The learning task can be defined as follows. Given: 𝑁 classes of behavior, ({good, bad} in our current application);3 and a set of labeled observed traces, 𝑇𝐶𝑖 , ∀𝐶𝑖 ∈ {g𝑜𝑜𝑑, 𝑏𝑎𝑑} Obtain: a classifier that takes as input a new (partial) trace 𝑡 (with unknown class) and outputs the predicted class… Good refers to standard customers’ behavior and bad corresponds to money laundering-related behavior.” Pg. 4, left column, under “3.1 Learning Task”)

Regarding claims 2, 9, and 16, Wu teaches selecting the action from a current probability distribution of available actions for a current state of the reinforcement learning agent (“A MDP [Markov Decision Process] is defined as a tuple (S, A, P, R, [Symbol font/0x67]) where: S is the state space of the process; A is a finite set of actions; P is a Markovian transition model, where P(s, a, s[Symbol font/0xA2]) is the probability of making a transition to state s[Symbol font/0xA2] when taking action a in state s; R is a reward (or cost) function, such that R(s, a) is the expected reward for taking action a in state s; [Symbol font/0x67] [Symbol font/0xCE] [0, 1) is the discount factor for future rewards… For each turn, the environment receives the action a chosen by the agent...”, pg. 3, left column, under “Deep Reinforcement Learning”) wherein the current probability distribution favors a subset of the available actions that do not trigger an alert under the one or more scenarios (“Reward: a reward value [between 0 and R], where 0 denotes that the botnet flow was detected by the detection model and R is the reward for evading the model.”, pg. 3, right column, bottom bullets) (The less the attacker is detected, the greater the reward, so the agent is rewarded greater for actions that don’t alert the monitoring system) executing the action to move the reinforcement learning agent into a new state (“For each turn, the environment receives the action a chosen by the agent…", pg. 3, left column, under Deep Reinforcement Learning) and evaluating the new state with the one or more scenarios to determine alert states of the one or more scenarios resulting from the action (“For each turn, the environment receives the action a chosen by the agent, and feeds back the observed state s[Symbol font/0xA2] (after executing a) and reward r. The agent follows a policy [Symbol font/0x70] (a|s) based on the estimated value determined by the s and r.” pg. 3, left column, under “Deep Reinforcement Learning”)
However, Wu does not explicitly teach appending a record of the action and the alert states to the episode as the step row. 
Mhaisen teaches appending a record of the action and the alert states to the episode as the step row. (“In Algorithm 2, we initialize two copies of parameters (NN weights). The agent starts the training process by repeatedly generating episodes of experience (lines 3-6) and collecting experience tuples through interactions (lines 8-10). These tuples of state–action transitions represent the experience data and are stored in the replay buffer D to be used in the learning process” pg. 44, left column, bottom paragraph; See also Algorithm and 2) (The variable represents a state which is appended, at each time step, as to keep a record of the actions and states of the agent.)
Therefore, it would have been considered obvious to one of ordinary skill in the art,
prior to the current application’s filing date, to combine the RL agent of Wu with the appending of a record as disclosed by Mhaisen. One would be motivated to combine the two teachings, prior to the filing date of the current application, as recording the state allows for the agent’s state and action history to be verified, as disclosed by Mhaisen. (“aggregated_d is recorded for verifiable auditability and checked against the agreed-upon rules to automatically execute business logic.” pg. 41, right column, under Smart contract)

	Regarding claim 3, 10, and 17, Wu, as modified by Mhaisen, teaches the method, system, and non-transitory computer-readable medium of claims 2, 9, and 16 respectively, as well as, repeating the selecting the action, the executing the action, the evaluating the new state and the appending the record until the task/transfer is complete. (“The agent follows a policy [Symbol font/0x70] (a|s) based on the estimated value determined by the s and r. The process stops when a target state is reached through a series of exploration and exploitation.” Pg. 5, left column, under “C. Relevant Parameters” (Wu)) (The combination teaches that the task/transfer is complete when the agent reaches a target state.)

	Regarding claim 4, 11, and 18 Wu, as modified by Mhaisen, teaches the method, system, and non-transitory computer-readable medium of claims 1, 8, and 15 respectively, as well as configuring probability distributions of available actions for states of the reinforcement learning agent to favor actions that do not trigger an alert under the one or more scenarios. (“P(s, a, s[Symbol font/0xA2]) is the probability of making a transition to state s[Symbol font/0xA2] when taking action a in state s; R is a reward (or cost) function, such that R(s, a) is the expected reward for taking action a in state s… Reward: a reward value [Symbol font/0xCE] {0, R}, where 0 denotes that the botnet flow was detected by the detection model and R is the reward for evading the model. In our experiments, we use R=10.”, pg. 3, left column, under Deep Reinforcement Learning; pg. 3 right column, the first bullet (Wu)) (The less the attacker is detected, the greater the reward, so the agent is rewarded greater for actions that don’t alert the monitoring system)

	Regarding claim 5, 12, and 19, Wu, as modified by Mhaisen, teaches the method, system, and non-transitory computer-readable medium of claims 1, 8, and 15 respectively, as well as the instructions to analyze the steps taken in the episode to measure the strength of monitoring in the monitoring system further cause the computer to determine a number of steps in the episode. (“An episode is a finite sequence of states, actions, and rewards… During an episode, states are transitioned and reward values are emitted…”, pg. 43, under “4.2 Environment Designs”, “In our experiments, we adopt Boltzmann exploration to select next action and allow the agent to perform up to ten actions before declaring failure.”, Pg. 5, left column, under “C. Relevant Parameters” (Wu)) (This teaches it, as Wu needs to determine the number of actions performed before declaring failure at ten actions, and episodes comprise actions (state transitions and reward values emitted) performed by the agent. Wu’s Fig. 1 shows this, (as seen below) as the chart does not show average actions above 9 actions.)


    PNG
    media_image1.png
    223
    257
    media_image1.png
    Greyscale

Wu’s Fig. 1


Claims 6, 13, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Di Wu et al. (Herein referred to as Wu) (Evading Machine Learning Botnet Detection Models via Deep Reinforcement Learning) in view of Narma Mhaisen et al. (Herein referred to as Mhaisen) (To chain or not to chain: A reinforcement learning approach for blockchain-enabled IoT monitoring applications) and in even further view of Khalid El-Awady (Adaptive Stress Testing for Adversarial Learning in a Financial Environment)

Regarding claims 6, 13, and 20, Wu, as modified by Mhaisen, teaches the method, system, and non-transitory computer-readable medium of claims 1, 8, and 15 respectively but does not teach determining a number of accounts used for transfer in the episode.
Khalid El-Awady teaches determining a number of accounts used for transfer in the episode. (“for each episode… Initialize the state, s0: agent randomly selects a new or old payment card account to use, daily transactions are set to 0, and the fraud indicator set to 0.”, pg. 6, Algorithm 1 AST Q-learning, (See Khalid El-Awady’s Algorithm below))
Therefore, it would have been considered obvious to one of ordinary skill in the art, prior to the current application’s filing date, to combine Wu’s adversarial reinforcement learning agent, with the episodes of Mhaisen, and the accounts disclosed in Khalid El-Awady. One would be motivated to combine the two teachings, prior to the filing date of the current application, as one could use a machine learning AI in a Financial Environment, for Adaptive Stress Testing, as disclosed by Khalid El-Awady. (“Adaptive Stress Testing for Adversarial Learning in a Financial Environment”, pg. 1, Title)


    PNG
    media_image2.png
    296
    627
    media_image2.png
    Greyscale

Khalid El-Awady’s Algorithm 1


Claims 7 and 14 is rejected under 35 U.S.C. 103 as being unpatentable over Di Wu et al. (Herein referred to as Wu) (Evading Machine Learning Botnet Detection Models via Deep Reinforcement Learning) in view of Narma Mhaisen et al. (Herein referred to as Mhaisen) (To chain or not to chain: A reinforcement learning approach for blockchain-enabled IoT monitoring applications) and in further view of Yan et al. (Herein referred to as Yan) (U.S. Patent Application No. US 20180365696 A1)

Regarding claims 7 and 14, Wu, as modified by Mhaisen teaches the method and system of claims 1 and 8 respectively but does not teach determining a percentage of amount transferred to a destination account before a cutoff by one of (i) generation of an alert or (ii) reaching a cap on episode length.
Yan teaches determining a percentage of amount transferred to a destination account before a cutoff by reaching a cap on episode length. ("…the suspicious percentage detector 222 can be used to detect, e.g., suspicious remittances, however other transactions such as, e.g., cash transfers withdrawals, deposits, among others are contemplated. Therefore, the suspicious percentage detector 222 can receive remittance histories for each account holder in the account holder cluster 211, including, e.g., remittance percentages. Here, a remittance percentage is used to signify the remittance amount divided by an account balance for a given user…”, pg. 4, Paragraph 42)
Therefore, it would have been considered obvious to one of ordinary skill in the art, prior to the current application’s filing date, to combine Wu’s adversarial reinforcement learning agent, with the episodes of Mhaisen, and the fraud detectors of Yan. One would be motivated to combine the two teachings, prior to the filing date of the current application, as one could combine the agent with the fraud detector systems to mitigate fraud in transactions, as disclosed by Yan. (“According to an aspect of the present principles, a method is provided for mitigating fraud in transactions.”, pg. 1, Paragraph 4)


Response to Arguments
Applicant's arguments filed on November 3rd, 2025 have been fully considered but they are not persuasive. The amendments overcome the previous 101 rejections, and the 101 rejections have been withdrawn.
The applicant argues in substance:
Argument 1: Mhaisen does not teach “sampling the policy to simulate an episode of steps taken by the reinforcement learning agent.”
The examiner respectfully disagrees. Mhaisen discloses that a policy is a strategy for an agent, specifically a mapping between states st ∈ S and (a distribution of) actions. (pg. 5, under “4.2. Environment design”) These states are sampled, as disclosed on pg. 5 under “4.2.1. State transitions” and form an episode for the agent to train off of, as detailed in Algorithm 2 on pg. 6. Therefore, the rejection is sustained.

Argument 2: The current references does not teach the newly amended limitations of claims 1, 8, and 15.
Applicant’s arguments with respect to claim(s) 1, 8, and 15 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Argument 3: The alternative limitation in Claim 3 – Cap on Number of Steps is removed from the claim 3 . Therefore, the combination of references fails to each limitation of the claims
The examiner respectfully disagrees. Wu explicitly teaches the appending of a record until the task is complete, as cited above, and it would be easy to configure Wu’s method to terminate the appending of the record when an alert is triggered, as Wu disclosed a reward function that provides 0 reward when an alert is triggered and the agent is detected. Therefore, the rejection is sustained.

Argument 4: The current references does not teach the newly amended limitations of claims 8, specifically the transferring an amount from a source account to a destination account and a monitoring system configured to hinder the transfer.
Applicant’s arguments with respect to claim 8, have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.


Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Tyler E Iles whose telephone number is (571)272-5442. The examiner can normally be reached 9:00am - 5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached at (571) 272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/T.E.I./             Patent Examiner, Art Unit 2122                                                                                                                                                                                           

/KAKALI CHAKI/             Supervisory Patent Examiner, Art Unit 2122
Read full office action
Prosecution Timeline

May 19, 2022
Application Filed
Aug 01, 2025
Non-Final Rejection mailed — §103
Nov 03, 2025
Response Filed
Feb 12, 2026
Final Rejection mailed — §103
May 04, 2026
Interview Requested
May 12, 2026
Request for Continued Examination
May 16, 2026
Response after Non-Final Action
Precedent Cases

Applications granted by this same examiner with similar technology

17/559,396
Patent 12619883
SYSTEMS AND METHODS FOR DETERMINING TIME-SERIES FEATURE IMPORTANCE OF A MODEL
4y 4m to grant Granted May 05, 2026
Study what changed to get past this examiner. Based on 1 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
60%
Grant Probability
99%
With Interview (+50.0%)
3y 8m (~0m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 5 resolved cases by this examiner. Grant probability derived from career allowance rate.