Prosecution Insights
Last updated: April 19, 2026
Application No. 17/951,870

ADVERSARIAL ATTACKS FOR IMPROVING COOPERATIVE MULTI-AGENT REINFORCEMENT LEARNING SYSTEMS

Final Rejection §101§103
Filed
Sep 23, 2022
Examiner
TRAN, TAN H
Art Unit
2141
Tech Center
2100 — Computer Architecture & Software
Assignee
International Business Machines Corporation
OA Round
2 (Final)
60%
Grant Probability
Moderate
3-4
OA Rounds
3y 6m
To Grant
92%
With Interview

Examiner Intelligence

Grants 60% of resolved cases
60%
Career Allow Rate
184 granted / 307 resolved
+4.9% vs TC avg
Strong +32% interview lift
Without
With
+31.8%
Interview Lift
resolved cases with interview
Typical timeline
3y 6m
Avg Prosecution
60 currently pending
Career history
367
Total Applications
across all art units

Statute-Specific Performance

§101
14.4%
-25.6% vs TC avg
§103
55.3%
+15.3% vs TC avg
§102
19.2%
-20.8% vs TC avg
§112
6.1%
-33.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 307 resolved cases

Office Action

§101 §103
Notice of Pre-AIA or AIA Status 1. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . DETAILED ACTION 2. This Office Action is sent in response to Applicant’s Communication received on 12/18/2025 for application number 17/951,870. Response to Amendments 3. The Amendment filed 12/18/2025 has been entered. Claims 1,4, 7, 11, 13, and 16-18 have been amended. Claims 1-20 remain pending in the application. Response to Arguments Step 2A, Prong 1: Applicant argues that the claimed invention is not directed to an abstract idea as explained in paragraphs [0018]-[0021] of the Applicant's specification, rather than any mathematical relationship per se. As such, contrary to the Examiner's assertions, claim 1 does not recite any mathematical relationship within the context of the § 101 analysis. The Examiner respectfully disagrees, claim 1 describes mathematical concepts via optimization and model-based selection. In particular, claim 1 recites “a perturbation optimizer to generate a state perturbation, based on the dynamics model, selecting the state perturbation that transitions towards a targeted failure state.” An optimizer that generates a perturbation to drive the system toward a target state is, by its plain meaning, an optimization calculation. This goes beyond merely stately training at a high level. The claim recites an optimizer and an objective-driven selection criterion, which describes the mathematical operation at a functional level. Accordingly, the Examiner maintains that claim 1 recites an abstract idea in the form of mathematical concepts, even if it does not present explicitly equations. Step 2A, Prong 2: Applicant argues that claim 1 is directed towards an improved technical solution for performing adversarial attacks on c-MARL systems to evaluate and improve their robustness. See specification at paragraphs [0018]-[0021]. Applicant's specification provides a technical explanation as to how to implement the invention with sufficient details such that one of ordinary skill in the art would recognize the claimed invention as providing a technical improvement in robustness evaluation of machine learning models. The Examiner respectfully disagrees and notes that while the specification describes examples and asserts beneficial applications, claim 1 remains at a high level and does not recite the specific technical mechanism that achieves the robustness improvement. Under MPEP 2106.04(d), merely stating a desired result or using a computer as a tool to perform the abstract idea generally does not integrate the judicial exception into a practical application. Claim Rejections: Applicant argues that Lin does not teach selecting the state perturbation that transitions the c-MARL system towards a targeted failure state. However, the argument is moot since this is a newly presented limitation, thus changes the scope of the claim. However, a newly found reference, Lin 2019, is applied. Applicant argues that Lin does not teach selecting one or more agents of the c-MARL system as having enhanced vulnerability of transitioning the c-MARL system towards the targeted failure state. However, the Examiner respectfully disagrees and notes that Lin states “Our threat model considers a single vulnerable agent whose observation can be modified by the adversary. We assume the adversary has already predetermined this agent.” This disclosure teaches both that the agent is characterized as vulnerable and that the adversary selects that agent. Further, Lin 2019’s attack framework is designed to drive the cooperative system toward an undesired outcome by targeting that vulnerable agent’s observation such that the agent takes adversarial selected actions that reduce team reward. Claim Rejections - 35 USC § 101 4. 35 U.S.C. 101 reads as follows: Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title. Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to the abstract idea without significantly more. Step 1, the claims are directed to a process, machine, and manufacture. Step 2A Prong 1, Claims 1, 11, and 16 recite, in part a perturbation optimizer to generate a state perturbation, based on the dynamics model, selecting the state perturbation that transitions towards a targeted failure state (Mathematical concepts, mathematical calculations/optimization). selecting one or more agents as having enhanced vulnerability of transitioning towards the targeted failure state (Mental processes, evaluation/judgment about information). Step 2A Prong 2, this judicial exception is not integrated into a practical application. The additional elements: a processor set, one or more computer readable storage media, and program instructions collectively stored on the one or more computer readable storage media (mere instructions to apply the exception using a generic computer component). by a computing system, processing, by the computer system (mere instructions to apply the exception using a generic computer component). attacking, by the computing system, the c-MARL system based on the state perturbation and the selected one or more agents (mere instructions to apply the exception using a generic computer component). evaluating, by the computing system, a level of success of the attack to improve robustness of the c-MARL system (insignificant extra-solution activity). Step 2B, the claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception, either alone or in combination. The additional elements: a processor set, one or more computer readable storage media, and program instructions collectively stored on the one or more computer readable storage media (mere instructions to apply the exception using a generic computer component). by a computing system, processing, by the computer system (mere instructions to apply the exception using a generic computer component). attacking, by the computing system, the c-MARL system based on the state perturbation and the selected one or more agents (mere instructions to apply the exception using a generic computer component). evaluating, by the computing system, a level of success of the attack to improve robustness of the c-MARL system (insignificant extra-solution activity). Claims 2-10, 12-15, and 17-20 provide further limitations to the abstract idea (Mathematical concepts and/or Mental processes) as rejected in claims 1, 11, and 16, however, they do not disclose any additional elements that would amount to a practical application or significantly more than an abstract idea (data gathering/insignificant extra-solution activity and/or generic computer component). Claim Rejections – 35 USC § 103 5. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. 6. Claims 1, 5-9, 11, 13-16, and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Lin et al. (On the Robustness of Cooperative Multi-Agent Reinforcement Learning, IEEE, published 2020, pages 62-68) in view of Haley et al. (U.S. Patent Application Pub. No. US 20240046204 A1) and further in view of Lin 2019 et al. (Tactics of Adversarial Attack on Deep Reinforcement Learning Agents, arXiv, published 2019, pages 1-7). Claim 1: Lin teaches a method comprising: training, a dynamics model of a cooperative multi-agent reinforcement learning (c-MARL) (i.e. cooperative Multi-Agent Reinforcement Learning (c-MARL); Section I. Introduction; page 62) environment of a c-MARL system (i.e. fig. 2, the attacker first trains a policy network with reinforcement learning to find a wrong action it should encourage the victim agent to take; pages 62-64); processing, a perturbation optimizer to generate a state perturbation of the c-MARL environment, based on the dynamics model (i.e. we use gradient-based targeted adversarial example crafting to perturb the victim's observation such that it will take the action returned by the adversarial policy in the first step; Section II, III, pages 62-65), the processing including perturbation that transitions (i.e. we use gradient-based targeted adversarial example crafting to perturb the victim's observation such that it will take the action returned by the adversarial policy in the first step; Section II, III, pages 62-65) the c-MARL system towards a targeted failure state (i.e. fig. 2, the attacker first trains a policy network with reinforcement learning to find a wrong action it should encourage the victim agent to take; pages 62-64); selecting one or more agents of the c-MARL system (i.e. Our threat model considers a single vulnerable agent i^ whose observation oi^t can be modified by the adversary. We assume the adversary has already predetermined this agent; Section II, pages 63-64) as having enhanced vulnerability of transitioning the c-MARL system towards the targeted failure state (i.e. Despite the success of the above methods in reducing R, the victim agent's negative effect upon the team performance remains limited. To help improve the robustness of c-MARL, it is important to understand the worst-case behavior of a single agent. We propose RL-based optimization to more effectively select an action for the victim agent; Section III, page 64); and attacking (i.e. By attacking a single agent, our attack method has highly negative impact on the overall team reward, reducing it from 20 to 9.4. This results in the team's winning rate to go down from 98.9% to 0%; abs, page 62), the c-MARL system based on the state perturbation and the selected one or more agents (i.e. we use gradient-based targeted adversarial example crafting to perturb the victim's observation such that it will take the action returned by the adversarial policy in the first step; Section II, III, pages 62-65); and evaluating, a level of success of the attack to improve robustness of the c-MARL system (i.e. to assess their robustness in order to develop tailored defense mechanisms that promote security and safety in c-MARL … to improve robustness of c-MARL systems, adversarial manipulations of both team agents and the environment need to be taken into account during training; Section V, pages 62, 67). Lin does not explicitly teach a computing system; the processing including selecting the state perturbation that transitions the c-MARL system towards a targeted failure state. However, Haley teaches a computing system (i.e. figs. 1, 3a, an exemplary embodiment can include hardware, such as, one or more processors configured to read and execute software programs. Such programs (and any associated data) can be stored and/or retrieved from one or more storage devices. The hardware can also include power supplies, network devices, communications devices, and input/output devices, such devices for communicating with local and remote resources and/or other computer systems. Such embodiments can include one or more computer systems, and are optionally communicatively coupled to one or more additional computer systems that are local or remotely accessed; para. [0023]). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Lin to include the feature of Haley. One would have been motivated to make this modification because reinforcement learning environments, including cooperative multi-agent systems, are conventionally executed on computing systems. However, Lin 2019 teaches the processing including selecting the state perturbation that transitions the c-MARL system towards a targeted failure state (i.e. The goal of the enchanting attack is to lure the deep RL agent from current state st at time step t to a specified target state sg after H steps. The adversary needs to craft a series of adversarial examples st+1 + δt+1,...,st+H + δt+H for this attack; Section 3, pages 2-5). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Lin and Haley to include the feature of Lin 2019. One would have been motivated to make this modification because it makes adversarial attacks on RL stealthier and mor efficient. Claim 5: Lin, Haley, and Lin 2019 teach the method of claim 1. Lin further teaches wherein the dynamics model comprises: a dynamics model of a reinforcement learning (RL) environment (i.e. Techniques from Inverse Reinforcement Learning and model-based RL can be applied here; Section V) of the c-MARL system (i.e. fig. 2, the attacker first trains a policy network with reinforcement learning to find a wrong action it should encourage the victim agent to take; pages 62-64. Claim 6: Lin, Haley, and Lin 2019 teach the method of claim 5. Lin further teaches wherein the dynamics model is configured to generate a predicted subsequent state of the c-MARL system as a function of a current state and actions performed in the c-MARL system (i.e. st is the state, and st+1 is determined by a transition function T in the environment; Section III). Claim 7: Lin, Haley, and Lin 2019 teach the method of claim 5. Lin further teaches wherein the perturbation optimizer is configured to generate a state perturbation based on the predicted subsequent state of the c-MARL system (i.e. we use gradient-based targeted adversarial example crafting to perturb the victim's observation such that it will take the action returned by the adversarial policy in the first step; Section II, III, pages 62-65) and the targeted failure state of the c-MARL system (i.e. fig. 2, the attacker first trains a policy network with reinforcement learning to find a wrong action it should encourage the victim agent to take; pages 62-64) that is in opposition to a reward function of the c-MARL environment (i.e. By attacking a single agent, our attack method has highly negative impact on the overall team reward, reducing it from 20 to 9.4. This results in the team's winning rate to go down from 98.9% to 0%; abs, page 62). Claim 8: Lin, Haley, and Lin 2019 teach the method of claim 1. Lin further teaches wherein generating the state perturbation of the c-MARL environment (i.e. we use gradient-based targeted adversarial example crafting to perturb the victim's observation such that it will take the action returned by the adversarial policy in the first step; Section II, III, pages 62-65) based on the dynamics model is configured for degrading a performance of a trained c-MARL policy of the c-MARL system (i.e. By attacking a single agent, our attack method has highly negative impact on the overall team reward, reducing it from 20 to 9.4. This results in the team's winning rate to go down from 98.9% to 0%; abs, page 62). Claim 9: Lin, Haley, and Lin 2019 teach the method of claim 1. Lin further teaches wherein the c-MARL system comprises a set of agents, wherein selecting the one or more agents of the c-MARL system (i.e. Our threat model considers a single vulnerable agent i^ whose observation oi^t can be modified by the adversary; Section II, pages 63-64) as having enhanced vulnerability (i.e. Despite the success of the above methods in reducing R, the victim agent's negative effect upon the team performance remains limited. To help improve the robustness of c-MARL, it is important to understand the worst-case behavior of a single agent. We propose RL-based optimization to more effectively select an action for the victim agent; Section III, page 64) comprises: identifying one or more of the agents as able to achieve greater adversarial attack performance (i.e. fig. 2, the attacker first trains a policy network with reinforcement learning to find a wrong action it should encourage the victim agent to take; pages 62-64), for the same attack resource budget (i.e. To learn an adversarial policy for the victim agent, we applied the methods from III-A. We trained an adversarial policy for the victim to select suboptimal actions that minimize the total reward. To evaluate the performance of the policy, we directly control the action of the victim based on the output of our adversarial policy. We ran each attack method for 500 games and presented the results in table I. From the results, OW and OWR have the highest negative impact on team reward and win rate, with 100% loss rate, making these two methods the most efficient; Section IV). Claim 15: Lin, Haley, and Lin 2019 teach the computer program product of claim 11. Lin further teaches wherein the c-MARL system comprises a set of agents, wherein the program instructions are further executable to: identify one or more of the agents as having enhanced vulnerability (i.e. Our threat model considers a single vulnerable agent i^ whose observation oi^t can be modified by the adversary; Section II, pages 63-64); and target the attacking on the one or more of the agents identified (i.e. fig. 2, the attacker first trains a policy network with reinforcement learning to find a wrong action it should encourage the victim agent to take; pages 62-64) as having enhanced vulnerability (i.e. By attacking a single agent, our attack method has highly negative impact on the overall team reward, reducing it from 20 to 9.4. This results in the team's winning rate to go down from 98.9% to 0%; abs, page 62). Claims 11, 13-14, 16, 18-20 are similar in scope to Claims 1, 5-8, 15 and are rejected under a similar rationale. 7. Claims 2-4, 12, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Lin, Haley, Lin 2019, and further in view of Huangfu et al. (U.S. Patent Application Pub. No. US 20230179490 A1). Claim 2: Lin, Haley, and Lin 2019 teach the method of claim 1. Lin further teaches comprising: performing learning training of the dynamics model (i.e. fig. 2, the attacker first trains a policy network with reinforcement learning to find a wrong action it should encourage the victim agent to take; pages 62-64) based on the c-MARL system (i.e. Techniques from Inverse Reinforcement Learning and model-based RL can be applied here. Another possible defense method involves formulating all agents as potential adversaries during the MARL centralized training, so that an agent can react better to adversarial actions during execution; Section V). Lin does not explicitly teach supervised learning. However, Huangfu teaches performing supervised learning training (i.e. When the AI model is used in supervised learning to improve the MRO, the training data includes a network status and a label; para. [0201, 0377]). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Lin, Haley, and Lin 2019 to include the feature of Huangfu. One would have been motivated to make this modification because supervised learning was a well-known technique for improving model accuracy and stability. Claim 3: Lin, Haley, Lin 2019, and Huangfu teach the method of claim 2. Lin further teaches wherein attacking the c-MARL system (i.e. Our threat model considers a single vulnerable agent i^ whose observation oi^t can be modified by the adversary; Section II, pages 63-64) comprises injecting the state perturbation into a state input to a multi-agent system of the c-MARL system (i.e. we use gradient-based targeted adversarial example crafting to perturb the victim's observation such that it will take the action returned by the adversarial policy in the first step; Section II, III, pages 62-65). Claim 4: Lin, Haley, Lin 2019, and Huangfu teach the method of claim 3. Lin further teaches wherein processing the perturbation optimization formulation (i.e. we use gradient-based targeted adversarial example crafting to perturb the victim's observation such that it will take the action returned by the adversarial policy in the first step; Section II, III, pages 62-65 further comprises optimizing the state perturbation to transition the c-MARL system to a targeted failure state based on predictions generated by the dynamics model (i.e. fig. 2, the attacker first trains a policy network with reinforcement learning to find a wrong action it should encourage the victim agent to take; pages 62-64), wherein actions taken are in opposition to a reward function of the c-MARL environment (i.e. By attacking a single agent, our attack method has highly negative impact on the overall team reward, reducing it from 20 to 9.4. This results in the team's winning rate to go down from 98.9% to 0%; abs, page 62). Claims 12 and 17 are similar in scope to Claims 2-4 and are rejected under a similar rationale. 8. Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Lin, Haley, Lin 2019, and further in view of Deisenroth et al. (PILCO: A Model-Based and Data-Efficient Approach to Policy Search; published 2011, pages 1-8). Claim 10: Lin, Haley, and Lin 2019 teach the method of claim 1. Lin further teaches wherein the attack system is configured to detect vulnerability of agents (i.e. Our threat model considers a single vulnerable agent i^ whose observation oi^t can be modified by the adversary; Section II, pages 63-64) in the c-MARL system to adversarial attacks in an action space (i.e. fig. 2, the attacker first trains a policy network with reinforcement learning to find a wrong action it should encourage the victim agent to take; pages 62-64). Lin does not explicitly teach a continuous action space. However, Deisenroth teaches a continuous action space (i.e. Pilco achieves unprecedented data efficiency in continuous state-action domains and is directly applicable to physical systems, e.g., robots; pages 1-2). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Lin, Haley, and Lin 2019 to include the feature of Deisenroth. One would have been motivated to make this modification because continuous action spaces were a known alternative to discrete action spaces in reinforcement learning, and applying them would have predictably improved the realism and applicability of the system to physical multi-agent environments. Conclusion The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure. Yang et al. (Pub. No. US 20200160168 A1), cooperative multi-goal, multi-agent, multi-stage (CM3) reinforcement learning may include training a first agent using a first policy gradient and a first critic using a first loss function to learn goals in a single-agent environment using a Markov decision process, training a number of agents based on the first policy gradient and a second policy gradient and a second critic based on the first loss function and a second loss function to learn cooperation between the agents in a multi-agent environment using a Markov game to instantiate a second agent neural network, each of the agents instantiated with the first agent neural network in a pre-trained fashion, and generating a CM3 network policy based on the first agent neural network and the second agent neural network. The CM3 network policy may be implemented in a CM3 based autonomous vehicle to facilitate autonomous driving. Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. It is noted that any citation to specific pages, columns, lines, or figures in the prior art references and any interpretation of the references should not be considered to be limiting in any way. A reference is relevant for all it contains and may be relied upon for all that it would have reasonably suggested to one having ordinary skill in the art. In re Heck, 699 F.2d 1331, 1332-33, 216 U.S.P.Q. 1038, 1039 (Fed. Cir. 1983) (quoting In re Lemelson, 397 F.2d 1006, 1009, 158 U.S.P.Q. 275, 277 (C.C.P.A. 1968)). Any inquiry concerning this communication or earlier communications from the examiner should be directed to TAN TRAN whose telephone number is (303)297-4266. The examiner can normally be reached on Monday - Thursday - 8:00 am - 5:00 pm MT. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Matt Ell can be reached on 571-270-3264. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /TAN H TRAN/Primary Examiner, Art Unit 2141
Read full office action

Prosecution Timeline

Sep 23, 2022
Application Filed
Oct 18, 2023
Response after Non-Final Action
Nov 05, 2025
Non-Final Rejection — §101, §103
Dec 12, 2025
Applicant Interview (Telephonic)
Dec 12, 2025
Examiner Interview Summary
Dec 18, 2025
Response Filed
Feb 12, 2026
Final Rejection — §101, §103
Apr 15, 2026
Applicant Interview (Telephonic)
Apr 16, 2026
Examiner Interview Summary

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12594668
BRAIN-LIKE DECISION-MAKING AND MOTION CONTROL SYSTEM
2y 5m to grant Granted Apr 07, 2026
Patent 12579420
Analog Hardware Realization of Trained Neural Networks
2y 5m to grant Granted Mar 17, 2026
Patent 12579421
Analog Hardware Realization of Trained Neural Networks
2y 5m to grant Granted Mar 17, 2026
Patent 12572850
METHOD FOR IMPLEMENTING MODEL UPDATE AND DEVICE THEREOF
2y 5m to grant Granted Mar 10, 2026
Patent 12572326
DIGITAL ASSISTANT FOR MOVING AND COPYING GRAPHICAL ELEMENTS
2y 5m to grant Granted Mar 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

3-4
Expected OA Rounds
60%
Grant Probability
92%
With Interview (+31.8%)
3y 6m
Median Time to Grant
Moderate
PTA Risk
Based on 307 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month