Last updated: April 19, 2026
Application No. 17/970,830
METHODS AND SYSTEMS THAT SAFELY IMPLEMENT CONTROL POLICIES WITHIN REINFORCEMENT-LEARNING-BASED MANAGEMENT-SYSTEM AGENTS

Non-Final OA §101
Filed
Oct 21, 2022
Examiner
ELCHANTI, TAREK
Art Unit
3621
Tech Center
3600 — Transportation & Electronic Commerce
Assignee
VMware, Inc.
OA Round
1 (Non-Final)
This examiner grants 50% of cases after interview

— +36.1% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 636 resolved cases, 2023–2026
Examiner Intelligence

ELCHANTI, TAREK View full profile →
Grants 50% of resolved cases
Career Allow Rate
318 granted / 636 resolved
-2.0% vs TC avg
Strong +36% interview lift
Without
With
+36.1%
Interview Lift
resolved cases with interview
Typical timeline
3y 4m
Avg Prosecution
41 currently pending
Career history
677
Total Applications
across all art units
Statute-Specific Performance

§101
44.1%
+4.1% vs TC avg
§103
32.6%
-7.4% vs TC avg
§102
10.3%
-29.7% vs TC avg
§112
8.8%
-31.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 636 resolved cases
Office Action

§101
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


DETAILED ACTION
1.	This is a first non-final Office Action on the merits for application 17970830. Claims 1-20 are pending examination.
	

Claim Rejections - 35 USC § 101
2.	35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception (i.e., a law of nature, a natural phenomenon, or an abstract idea) without significantly more.  
Claim(s) 18 is/are drawn to method (i.e., a process), claim(s) 1, 12 is/are drawn to a system (i.e., a machine/manufacture). As such, claims 1, 12, and 18 is/are drawn to one of the statutory categories of invention. 
Claims 1-20 are directed to reinforcement system agents that control distributed applications and the environments in which they run based on policy and prediction data and received traces. Specifically, claim(s) 1, 12, and 18 recite(s) a first policy component and a first prediction component, implemented by computer instructions, stored, when executed, control the first controller to receive state information and rewards from die controlled environment use the received state information to select actions to apply to the controlled environment by lookahead planning, using the first policy component and the first prediction component, use the received state information and rewards to generate traces- receive policy-update information, and update the policy component and prediction component with the received policy- update in Formation: and a second reinforcement-learning-based controller having a second policy component and a second prediction component implemented by computer instructions stored, control the first controller to receive traces, use the received traces to generate one or more policy-component and prediction-component training data sets and use the one or more policy-component and prediction-component training data training data sets to train the second policy component and second prediction component, which is grouped within the Methods Of Organizing Human Activity and is similar to the concept of (fundamental economic principles or practices including hedging insurance, mitigating risk) and (commercial or legal interactions including agreements in the form of contracts, legal obligations, advertising, marketing or sales activities or behaviors business relations) grouping of abstract ideas in prong one of step 2A of the Alice/Mayo test (See 2019 Revised Patent Subject Matter Eligibility Guidance, 84 Fed. Reg. 50, 52, 54 (January 7, 2019)). Accordingly, the claims recite an abstract idea (See pages 7, 10, Alice Corporation Pty. Ltd. v. CLS Bank International, et al., US Supreme Court, No. 13-298, June 19, 2014; 2019 Revised Patent Subject Matter Eligibility Guidance, 84 Fed. Reg. 50, 53-54 (January 7, 2019)).
The Claim limitations are listed under Methods Of Organizing Human Activity, and grouped as following:
a first policy component and a first prediction component, implemented by computer instructions, stored, when executed, control the first controller to receive state information and rewards from die controlled environment use the received state information to select actions to apply to the controlled environment by lookahead planning, using the first policy component and the first prediction component, use the received state information and rewards to generate traces- receive policy-update information, and update the policy component and prediction component with the received policy- update in Formation: and a second reinforcement-learning-based controller having a second policy component and a second prediction component implemented by computer instructions stored, which is similar to the concept of (advertising, marketing or sales activities or behaviors business relations),
control the first controller to receive traces, use the received traces to generate one or more policy-component and prediction-component training data sets and use the one or more policy-component and prediction-component training data training data sets to train the second policy component and second prediction component; which is similar to the concept of (advertising, marketing or sales activities or behaviors business relations).
This judicial exception is not integrated into a practical application because, when analyzed under prong two of step 2A of the Alice/Mayo test (See 2019 Revised Patent Subject Matter Eligibility Guidance, 84 Fed. Reg. 50, 54-55 (January 7, 2019)), the additional element(s) of the claim(s) such as controller, management system, computer system, computer, computer system, data-storage devices, memories, processors merely use(s) a computer as a tool to perform an abstract idea and/or generally link(s) the use of a judicial exception to a particular technological environment. Specifically, the controller, management system, computer system, computer, computer system, data-storage devices, memories, processors perform(s) the steps or functions of a first policy component and a first prediction component, implemented by computer instructions, stored, when executed, control the first controller to receive state information and rewards from die controlled environment use the received state information to select actions to apply to the controlled environment by lookahead planning, using the first policy component and the first prediction component, use the received state information and rewards to generate traces- receive policy-update information, and update the policy component and prediction component with the received policy- update in Formation: and a second reinforcement-learning-based controller having a second policy component and a second prediction component implemented by computer instructions stored, control the first controller to receive traces, use the received traces to generate one or more policy-component and prediction-component training data sets and use the one or more policy-component and prediction-component training data training data sets to train the second policy component and second prediction component. The use of a processor/computer as a tool to implement the abstract idea and/or generally linking the use of the abstract idea to a particular technological environment does not integrate the abstract idea into a practical application because it requires no more than a computer performing functions that correspond to acts required to carry out the abstract idea. The additional elements do not involve improvements to the functioning of a computer, or to any other technology or technical field (MPEP 2106.05(a)), the claims do not apply or use the abstract idea to effect a particular treatment or prophylaxis for a disease or medical condition (Vanda Memo), the claims do not apply the abstract idea with, or by use of, a particular machine (MPEP 2106.05(b)), the claims do not effect a transformation or reduction of a particular article to a different state or thing (MPEP 2106.05(c)), and the claims do not apply or use the abstract idea in some other meaningful way beyond generally linking the use of the abstract idea to a particular technological environment, such that the claim as a whole is more than a drafting effort designed to monopolize the exception (MPEP 2106.05(e) and Vanda Memo). Therefore, the claims do not, for example, purport to improve the functioning of a computer. Nor do they effect an improvement in any other technology or technical field. Accordingly, the additional elements do not impose any meaningful limits on practicing the abstract idea, and the claims are directed to an abstract idea.
The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because, when analyzed under step 2B of the Alice/Mayo test (See 2019 Revised Patent Subject Matter Eligibility Guidance, 84 Fed. Reg. 50, 52, 56 (January 7, 2019)), the additional element(s) of using a controller, management system, computer system, computer, computer system, data-storage devices, memories, processors to perform the steps amounts to no more than using a computer or processor to automate and/or implement the abstract idea of reinforcement system agents that control distributed applications and the environments in which they run based on policy and prediction data and received traces. As discussed above, taking the claim elements separately, the controller, management system, computer system, computer, computer system, data-storage devices, memories, processors perform(s) the steps or functions of a first policy component and a first prediction component, implemented by computer instructions, stored, when executed, control the first controller to receive state information and rewards from die controlled environment use the received state information to select actions to apply to the controlled environment by lookahead planning, using the first policy component and the first prediction component, use the received state information and rewards to generate traces- receive policy-update information, and update the policy component and prediction component with the received policy- update in Formation: and a second reinforcement-learning-based controller having a second policy component and a second prediction component implemented by computer instructions stored, control the first controller to receive traces, use the received traces to generate one or more policy-component and prediction-component training data sets and use the one or more policy-component and prediction-component training data training data sets to train the second policy component and second prediction component. These functions correspond to the actions required to perform the abstract idea. Viewed as a whole, the combination of elements recited in the claims merely recite the concept of reinforcement system agents that control distributed applications and the environments in which they run based on policy and prediction data and received traces. Therefore, the use of these additional elements does no more than employ the computer as a tool to automate and/or implement the abstract idea. The use of a computer or processor to merely automate and/or implement the abstract idea cannot provide significantly more than the abstract idea itself (MPEP 2106.05(I)(A)(f) & (h)). Therefore, the claim is not patent eligible.
As for dependent claims 2-11, 13-17, and 19-20 further describe the abstract idea of reinforcement system agents that control distributed applications and the environments in which they run based on policy and prediction data and received traces. Claim(s) 2-11, 13-17, and 19-20 does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because, when analyzed under step 2B of the Alice/Mayo test (See 2019 Revised Patent Subject Matter Eligibility Guidance, 84 Fed. Reg. 50, 52, 56 (January 7, 2019)), the additional element(s) of using a controller, management system, computer system, computer, computer system, data-storage devices, memories, processors to perform the steps amounts to no more than using a computer or processor to automate and/or implement the abstract idea of reinforcement system agents that control distributed applications and the environments in which they run based on policy and prediction data and received traces. As discussed above, taking the claim elements separately, the controller, management system, computer system, computer, computer system, data-storage devices, memories, processors perform(s) the steps or functions of wherein the first and second policy components included in the first and second controllers are policy neural netnxorks that each receixes a state vector representing state information for the controlled enironment at a specific point in time and that each outputs a probability-distribution vector representing the probabilities for selection of each of multiple actions given that the controlled environment occupies the state represented by the input state vector.; wherein the first and second prediction components included in the first and second controllers each includes: one or more neural networks that each receives a state vector, representing a state of the controlled en ironment at a specific point in time, and an action ector and that each outputs a predicted vector or value returned by the controlled environment to the management-system agent;and a neural network that receives a state vector representing a state of the controlled environment and that outputs an estimated value of the state; wherein each of the first and second prediction components includes: a Q neural network that generates a predicted reward, from an input state vector and action vector. that will be returned by the controlled environment when the action corresponding to the input action vector is executed by the controlled environment while in a state represented by the input state vector; and aI neural network that generates a predicted next state, from an input state vector and action vector, to which the controlled environment will transition following execution of an action corresponding to the input action vector \xhile the controlled environment is in a state represented by the input state vector; wherein the management-system agent uses the received state information to select actions to apply to the controlled environment by lookahead planning, using the first policy component and the first prediction component, by traversinga logical planning tree.. These functions correspond to the actions required to perform the abstract idea. Viewed as a whole, the combination of elements recited in the claims merely recite the concept of reinforcement system agents that control distributed applications and the environments in which they run based on policy and prediction data and received traces. Therefore, the use of these additional elements does no more than employ the computer as a tool to automate and/or implement the abstract idea. The use of a computer or processor to merely automate and/or implement the abstract idea cannot provide significantly more than the abstract idea itself (MPEP 2106.05(I)(A)(f) & (h)). Therefore, the claim is not patent eligible.

Prior Art
3.	In reference to independent claims 1, 12, and 18, the Office is unaware of any references that teach, individually or without an unreasonable combination of references, the combination of limitations found in the claims of:
“a first controller, having a first policy component and a first prediction component, implemented by computer instructions, stored in one or more of one or more memories and one or more data-storage devices within a first computer system. that, when executed by one or more processors of the first computer system. control the first controller to receive state information and rewards from die controlled environment use the received state information to select actions to apply to the controlled environment by lookahead planning, using the first policy component and the first prediction component, use the received state information and rewards to generate traces- receive policy-update information, and update the policy component and prediction component with the received policy- update in Formation: and a second reinforcement-learning-based controller having a second policy component and a second prediction component. implemented by computer instructions. stored in one or more of one or more memories and one or more data-storage devices within a second computer system, that-when executed by one or more processors of the second computer system, control the first controller to receive traces, use the received traces to generate one or more policy-component and prediction- component training data sets. and use the one or more policy-component and prediction-component training data training data sets to train the second policy component and second prediction component.”
Examiner note: none of the references or combined references teach the combination of limitations of claim 1, 12, and 18 or no reference found that would teaches the combination of limitations of claim 1, 12, and 18, especially the idea of the current document is directed to reinforcement-leaninirig-based management- system agents that control distributed applications and the infrastructure environmeits in which they run Managcment-system agents are initially trained in simulated environments and specialized training environments before being deployed to live, target distributed computer systems where they operate in a con/iv/er mode in which they do not explore the control-state space or attempt to learn better policies and value functions. but instead produce traces that are collected and stored for subsequent use. Each deployed management-system agent is associated with a twin training aent that uses the collected traces produced by the deployed management- system agent for optimiiing its policy and xalue functions. To further ensure safe operational control of the environment, the management-system agents employ lookahead planning, action budgets, and action constraints to forestall issuance, by management-system controllers, of potentially deleterious actions.
Examiner reason for allowance is not finding art to read on the limitations listing below and not finding art that would teach the concept of a first controller, having a first policy component and a first prediction component, implemented by computer instructions, stored in one or more of one or more memories and one or more data-storage devices within a first computer system. that, when executed by one or more processors of the first computer system. control the first controller to receive state information and rewards from die controlled environment use the received state information to select actions to apply to the controlled environment by lookahead planning, using the first policy component and the first prediction component, use the received state information and rewards to generate traces- receive policy-update information, and update the policy component and prediction component with the received policy- update in Formation: and a second reinforcement-learning-based controller having a second policy component and a second prediction component. implemented by computer instructions. stored in one or more of one or more memories and one or more data-storage devices within a second computer system, that-when executed by one or more processors of the second computer system, control the first controller to receive traces, use the received traces to generate one or more policy-component and prediction- component training data sets. and use the one or more policy-component and prediction-component training data training data sets to train the second policy component and second prediction component, and a a concept of reinforcement-leaninirig-based management- system agents that control distributed applications and the infrastructure environmeits in which they run Managcment-system agents are initially trained in simulated environments and specialized training environments before being deployed to live, target distributed computer systems where they operate in a con/iv/er mode in which they do not explore the control-state space or attempt to learn better policies and value functions. but instead produce traces that are collected and stored for subsequent use. Each deployed management-system agent is associated with a twin training aent that uses the collected traces produced by the deployed management- system agent for optimiiing its policy and xalue functions. To further ensure safe operational control of the environment, the management-system agents employ lookahead planning, action budgets, and action constraints to forestall issuance, by management-system controllers, of potentially deleterious actions.
The closest reference(s) found is/are similar but barely or do not teach all the limitations/steps of the claims:
CN115280322A: Method, system and device, comprising a computer program encoded on the computer storage medium, used for selecting the action executed by the action factor interacting with the environment, so as to cause the action factor to execute the task. One of the methods includes: receiving the current observation of the current environment state of the characteristic environment; performing a plurality of planning iterations to generate planning data, the planning data indicates the corresponding value of the task executing the action factor, the action factor executes each action in the action set in the environment and starts from the current environment state, wherein performing each planning iteration comprises selecting from the current environment state based on the output selection generated by the dynamic model and the prediction model, the action sequence to be executed by the action factor; and based on the planning data, in response to the current observation from the action set selecting action executed by the action factor. 
CN113552867A: The embodiment of the invention claims a planning method of motion track and wheel type mobile device, wheel type mobile device (such as automatic driving vehicle) establishes mapping relation of driving behaviour index and motion track based on teaching data. after establishing the mapping relationship, when the wheel type mobile device is running on the road of the real scene, after receiving the information of the current surrounding environment by the sensor, according to the preset method (such as strengthening learning) determining a most suitable target driving behaviour index; because the target driving behaviour index is determined according to the surrounding environment information; it can select suitable driving behaviour for the surrounding environment; then combining the mapping relationship; the real-time motion track corresponding to the target driving behaviour index can be planned; because the mapping relationship is established based on teaching data; Therefore, the generated real-time motion track is a driving track with similar person behaviour, so as to process complex interactive scene. 
CN112470123A: The invention claims a method for generating action selection policy (ASP) for execution device, system and device. A method comprises: obtaining the ASP in the current iteration; obtaining a first reward corresponding to each action in the current state; calculating a first reward of a current state based on a respective first reward and an ASP of each action; calculating the corresponding regret value of each action based on the difference between the corresponding first rewards of the action and the first rewards of the current state; calculating the increment ASP based on the corresponding regret value of each action in the current iteration; calculating a second reward of the current state based on the delta ASP; determining the ASP in the next iteration based on the second reward of the current state; and controlling each action of the executing device according to the ASP. 
US20200244707A1: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training a policy neural network having a plurality of policy parameters and used to select actions to be performed by an agent to control the agent to perform a particular task while interacting with one or more other agents in an environment. In one aspect, the method includes: maintaining data specifying a pool of candidate action selection policies; maintaining data specifying respective matchmaking policy; and training the policy neural network using a reinforcement learning technique to update the policy parameters. The policy parameters define policies to be used in controlling the agent to perform the particular task. 
US20190332922A1: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training a policy neural network used to select actions to be performed by a reinforcement learning agent interacting with an environment. In one aspect, a method includes obtaining path data defining a path through the environment traversed by the agent. A consistency error is determined for the path from a combined reward, first and last soft-max state values, and a path likelihood. A value update for the current values of the policy neural network parameters is determined from at least the consistency error. The value update is used to adjust the current values of the policy neural network parameters. 
US20190259051A1: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for prediction of an outcome related to an environment. In one aspect, a system comprises a state representation neural network that is configured to: receive an observation characterizing a state of an environment being interacted with by an agent and process the observation to generate an internal state representation of the environment state; a prediction neural network that is configured to receive a current internal state representation of a current environment state and process the current internal state representation to generate a predicted subsequent state representation of a subsequent state of the environment and a predicted reward for the subsequent state; and a value prediction neural network that is configured to receive a current internal state representation of a current environment state and process the current internal state representation to generate a value prediction. 
US20180032863A1: Methods, systems and apparatus, including computer programs encoded on computer storage media, for training a value neural network that is configured to receive an observation characterizing a state of an environment being interacted with by an agent and to process the observation in accordance with parameters of the value neural network to generate a value score. One of the systems performs operations that include training a supervised learning policy neural network; initializing initial values of parameters of a reinforcement learning policy neural network having a same architecture as the supervised learning policy network to the trained values of the parameters of the supervised learning policy neural network; training the reinforcement learning policy neural network on second training data; and training the value neural network to generate a value score for the state of the environment that represents a predicted long-term reward resulting from the environment being in the state. 
US20170140270A1: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for asynchronous deep reinforcement learning. One of the systems includes a plurality of workers, wherein each worker is configured to operate independently of each other worker, and wherein each worker is associated with a respective actor that interacts with a respective replica of the environment during the training of the deep neural network.
 
NPL Reference
4.	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. The NPL “Distributed Systems” describes “Distributed systems form a rapidly changing field of computer science. Since the previous edition of this book, exciting new topics have emerged such as peerto-peer computing and sensor networks, while others have become much more mature, like Web services and Web applications in general. Changes such as these required that we revised our original text to bring it up-to-date. This second edition reflects a major revision in comparison to the previous one. We have added a separate chapter on architectures reflecting the progress that has been made on organizing distributed systems. Another major difference is that there is now much more material on decentralized systems, in particular peer-to-peer computing. Not only do we discuss the basic techniques, we also pay attention to their applications, such as file sharing, information dissemination, content-delivery networks, and publish/subscribe systems. Next to these two major subjects, new subjects are discussed throughout the book. For example, we have added material on sensor networks, virtualization, server clusters, and Grid computing. Special attention is paid to self-management of distributed system...”.

	
Pertinent Art
5.	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Reference# CN117010476A teaches similar invention which describes: The invention claims a multi-agent autonomous decision method based on deep strengthening learning, applied to the technical field of multi-agent deep strengthening learning, aiming at the problem that the existing technology cannot fully use the experience data in the decision process; The invention collects the experience generated by the interaction between the intelligent agent and the environment, optimizes the environment instant reward by the reward remoulding module, and caches it in the local experience playback pool; then combining the history experience data in the local experience playback pool to the global experience playback pool PT-Buffer, and using the second-level prior experience queue to maintain; acquiring the training sample data from the PT-Buffer by the probability sum tree to solve the target strategy network and the global Critic network model; finally, the updated autonomous decision model is used for directly realizing the autonomous decision of multiple intelligent bodies according to the input environment local observation information so as to finish the related multi-intelligent body task.	

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to TAREK ELCHANTI whose telephone number is (571) 272-9638.  The examiner can normally be reached on Flex Mon - Thur 7-7:00 and Fri 7-4:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Waseem Ashraf can be reached on (571) 270-3948.  The fax phone number for the organization where this application or proceeding is assigned is (571) 273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/TAREK ELCHANTI/Primary Examiner, Art Unit 3621B
Read full office action
Prosecution Timeline

Oct 21, 2022
Application Filed
Oct 28, 2025
Non-Final Rejection — §101 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/183,846
Patent 12566988
QUANTUM COMPUTING SYSTEMS WITH DIABATIC SINGLE FLUX QUANTUM (SFQ) READOUT FOR SUPERCONDUCTING QUANTUM BITS
2y 5m to grant Granted Mar 03, 2026
17/553,515
Patent 12556396
OPT-OUT SYSTEMS AND METHODS FOR TAILORED ADVERTISING
2y 5m to grant Granted Feb 17, 2026
18/073,469
Patent 12555140
Systems, Devices, and Methods for Autonomous Communication Generation, Distribution, and Management of Online Communications
2y 5m to grant Granted Feb 17, 2026
18/473,781
Patent 12555142
METHOD, SYSTEM, AND RECORDING MEDIUM TO PROVIDE COMMUNITY NATIVE ADVERTISING
2y 5m to grant Granted Feb 17, 2026
18/588,109
Patent 12536561
Determining Winning Arms of A/B Electronic Communication Testing Using Resampling-Based Bayesian Nonparametrics
2y 5m to grant Granted Jan 27, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
50%
Grant Probability
86%
With Interview (+36.1%)
3y 4m
Median Time to Grant
Low
PTA Risk
Based on 636 resolved cases by this examiner. Grant probability derived from career allow rate.
METHODS AND SYSTEMS THAT SAFELY IMPLEMENT CONTROL POLICIES WITHIN REINFORCEMENT-LEARNING-BASED MANAGEMENT-SYSTEM AGENTS

This examiner grants 50% of cases after interview

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email