Last updated: May 29, 2026
Application No. 17/566,888
AUTOMATED GENERATION OF AGENT CONFIGURATIONS FOR REINFORCEMENT LEARNING

Non-Final OA §101§103
Filed
Dec 31, 2021
Examiner
SHALU, ZELALEM W
Art Unit
2145
Tech Center
2100 — Computer Architecture & Software
Assignee
Microsoft Technology Licensing, LLC
OA Round
3 (Non-Final)
Interview Optional

— +20.3% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 29% grant rate with +20.3% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 110 resolved cases, 2023–2026
Examiner Intelligence

SHALU, ZELALEM W View full profile →
Grants only 29% of cases
Career Allowance Rate
32 granted / 110 resolved
-25.9% vs TC avg
Strong +20% interview lift
Without
With
+20.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 6m
Avg Prosecution
21 currently pending
Career history
145
Total Applications
across all art units
Statute-Specific Performance

§101
0.9%
-39.1% vs TC avg
§103
86.6%
+46.6% vs TC avg
§102
4.1%
-35.9% vs TC avg
§112
8.2%
-31.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 110 resolved cases
Office Action

§101 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

This action is in response to the Application filed on 01/21/2026. Claims 1 and 4-22 are pending in the case.

Applicant Response
In Applicant’s response dated 01/21/2026, Applicant amended Claims 1, 4-6, 9-12, 16, 17, 19 and 20, cancelled claims 2 and 3 and argued against all objections and rejections previously set forth in the Office Action dated 11/06/2025.

Information Disclosure Statement
3.	As required by MPEP 609 (c), the Applicants’ submission of the Information Disclosure Statement(s) filed on 02/01/2026 are acknowledged by the examiner and the cited references have been considered in the examination of the claims now pending. 


Continued Examination under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 01/21/2026 has been entered.


Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1, and 4-22 are rejected under 35 U.S.C. 101 because the claimed invention is directed towards an abstract idea, without significantly more.

Step 1
	According to the first part of the analysis, in the instant case, the claims fall within one of the four statutory categories (i.e., process, machine, manufacture, or composition of matter).).

Regarding Claim 1, 11 and 19,
At step 2A, Prong 1, Does the claim recite a judicial exception?
Claim 1 recites the steps of:
obtaining an event log of events representing reactions of an environment to first actions taken by an agent … (This step involves receiving and logging data and is understood to be a data analysis process that can be performed with pen and paper, i.e., evaluation.)
… determining respective probabilities that, given the context associated with the events in the event log, different alternative reward functions would have also caused the agent to take the first actions represented in the event log; (This step involves mathematical modeling and is understood to be a recitation of a mental process or mathematical concept, i.e., math/evaluation.)
based at least on the respective probabilities, predicting performance of the different alternative reward functions for an evaluation metric; (This step involves mathematical modeling and is understood to be a recitation of a mental process or mathematical concept, i.e., math/evaluation.)
based at least on the predicted performance of the different alternative reward functions for the evaluation metric, selecting a second reward function from the different alternative reward functions. (This step involves predicting probability based on user data and is understood to be a recitation of a mental process or mathematical correlation, i.e., math/evaluation.);
configuring the agent according to the second reward function, the second reward function causing the agent to adapt internal parameters of the agent according to the second reward function and take second actions in the environment based at least on the second reward function. (This step for mathematical process and is understood to be a recitation of a mental process i.e., math/evaluation.)
The claim recites a judicial exception, a mathematical concept and mathematical applied in the field of machine learning. A person can collect data, evaluate options using metric s or criteria based on the log data which falls within the “Mental Processes” groupings of abstract ideas. Accordingly, the claims recite an abstract idea. 

Step 2A prong 2: Does the claim recite additional elements? Do those additional elements, individually and in combination, integrate the judicial exception into a practical application?
Further, the claim does not recite any additional element which could integrate this abstract idea into a practical application, because the additional elements recited of consist of:
reactions of an environment to actions taken by an agent, i.e. an agent is software policy in reinforcement learning and using a ML model to make decision that is not meaningful technical improvement.” as a tool to perform the abstract idea step of generating an output (see MPEP 2106.05(f)), and 
A system comprising: a processor; and a storage medium storing instructions which, when executed by the processor (claim 11), cause the system to which is a generic computer component on which to implement the abstract idea (see MPEP 2106.05(f));
A computer-readable storage medium storing instructions which, when executed by a computing device, cause the computing device to perform acts comprising (claim 19) which is a generic computer component on which to implement the abstract idea (see MPEP 2106.05(f)); 
The additional elements as disclosed above alone or in combination do not
integrate the judicial exception into practical application as they are generic computer functions in combination with limitations that are generally linking the use of the judicial exception to a particular technological environment or field of use that are implemented to perform the disclosed abstract idea above. Thus, the claim is directed towards the abstract idea.

Step 2B: Do the additional elements, considered individually and in combination, amount to significantly more than the judicial exception?
	No, as shown above with respect to integration of the abstract idea into a practical application, the additional element of: 
reactions of an environment to actions taken by an agent, i.e. an agent is software policy in reinforcement learning and using a ML model to make decision that is not meaningful technical improvement.” as a tool to perform the abstract idea step of generating an output (see MPEP 2106.05(f)), and 
A system comprising: a processor; and a storage medium storing instructions which, when executed by the processor (claim 11), cause the system to which is a generic computer component on which to implement the abstract idea (see MPEP 2106.05(f));
A computer-readable storage medium storing instructions which, when executed by a computing device, cause the computing device to perform acts comprising (claim 19) which is a generic computer component on which to implement the abstract idea (see MPEP 2106.05(f)); 
The additional elements as disclosed above alone or in combination do not
integrate the judicial exception into practical application as they are generic computer functions in combination with limitations that are generally linking the use of the judicial exception to a particular technological environment or field of use that are implemented to perform the disclosed abstract idea above. Thus, the claim is directed towards the abstract idea
Thus, the claims are not patent eligible. Mere instructions to apply an exception using generic computer components cannot provide an inventive concept. Neither can insignificant extra-solution activity. All of these additional elements as generically claimed are thus considered well-understood, routine, and conventional. Therefore, these limitations, taken alone or in combination, do not integrate the abstract idea into a practical application or recite significantly more that the abstract idea. 
Thus, these independent claims are not patent eligible.
	The dependent claims respectively recite a judicial exception in limitations of:  “wherein predicting the performance of the different alternative agent configurations comprises: determining, from the event log, predicted aggregate values of the evaluation metric for the plurality of alternative reward function.”(claims 4), “wherein determining the predicted aggregate values of the evaluation metric comprises: for each particular event in the event log: determining a value of the particular event for the evaluation metric, the value being determined based on a particular reaction of the environment to a particular action taken in a particular context by the agent; and weighting the value of the particular event to obtain weighted values of the evaluation metric for the different alternative reward functions, the weighting being based on corresponding probabilities that the agent would have given to the particular action with the different alternative reward functions relative to a probability that the agent actually gave to the particular action according to the first reward function; and aggregating the weighted values of each particular event for each alternative agent configurations, the aggregating resulting in  to obtain the predicted aggregate values of the evaluation metric.”(claim 5), “determining the value of the particular event based on a function for the evaluation metric. (claims 6), “wherein the function maps the actions and the context to the values of the evaluation metric.”(claims 7), “populating a data structure with predicted aggregate values of a plurality of evaluation metrics for the different alternative reward function.”(Claim 8), “wherein the data structure comprises a table with rows representing different agent configurations associated with the different alternative reward functions and columns representing different evaluation metrics.”(claim 9), outputting a graphical representation of the data structure; and identifying the second reward function based at least on user input directed to the graphical representation of the data structure.”(Claim 10), “wherein the instructions which, when executed by the processor, cause the system to: adapt the internal parameters of the agent by using the selected reward function to evaluate reactions of an environment to the further actions taken by the agent based on context describing the environment.”, (claim 12), “wherein the agent comprises a linear model that determines a probability density function of expected rewards for different actions based on the selected reward function.” (claim 13), “wherein the agent randomly samples from the probability density function and, in at least some instances, chooses an action that does not have the highest expected reward.” (claim 14), “the agent comprising a contextual bandit.” (claim 15), “wherein the actions comprise recommending electronic items, the reactions indicate whether users selected the recommended electronic items, and the context comprises information about the users.” (claim 16), “changing a size of a playout buffer that stores voice over internet protocol packets for video calls.” (claim 17), wherein the reactions indicate whether a playout buffer became empty during the video calls and the context indicates network jitter during the video calls.” (claim 18), “receiving, from the application, a reaction of the environment to the second action; determining a reward value for the second action based at least on the reaction and the second reward function; and updating internal parameters of the agent based at least on the reward value.” (claim 20); “replaying the events in the event log using the different alternative reward functions to determine the predicted action probabilities for the different alternative reward functions; weighting the events in the event log based at least on a ratio between the predicted action probabilities associated with the different alternative reward functions and log agent probabilities determined by the agent for the first actions according to the first reward function; and predicting the performance of the different alternative reward functions based at least on the weighting.” (claim 21); “wherein the predicted action probabilities are determined and the second reward function is selected prior to deploying the agent with the second reward function.” (claim 22);
	These additional limitations (in claims 4-10, 12-19 and 20-22) also constitute concepts performed in the human mind which fall within the “Mental Processes” groupings of abstract ideas.
	This judicial exception is not integrated into a practical application. Additional elements “computer readable medium comprising: computer program code (in claims 4-10, 12-19 and 20-22) all amount to no more than adding insignificant extra-solution activity/specifications related to data gathering, data input, or data transmittal. These additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The dependent claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements of non-transitory computer readable medium comprising: computer program code are again insignificant extra-solution activity steps that cannot provide an inventive concept. All of these additional elements as generically claimed are considered well-understood, routine, and conventional. 
Therefore, these limitations, taken alone or in combination, do not integrate the abstract idea into a practical application or recite significantly more that the abstract idea. Thus, all of the dependent claims are also not patent eligible.
	

Examiner Comments
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1 and 4-22 are rejected under 35 U.S.C. 103 as being unpatentable over Munawar (US 20190385091 A1, 2019-12-19) in view of Faust (Pub. No.: US 20210325894 A1, Pub. Date: 2021-10-21) in further view of TERRA (Pub. No.: US 20240119300 A1, Pub. Date: 2024-04-11)

Regarding independent Claim 1, 
Munawar teaches computer implemented method comprising: 
obtaining an event log of events representing reactions of an environment to first actions taken by an agent (see Munawar: Fig.3, [0004], “obtaining, from an environment, a given experience that includes an action, a state and a reward (i.e. event log of events.) The method further includes storing the given experience in an experience buffer responsive to a value of the reward included in the given experience exceeding a first threshold experience into an event buffer (i.e. event log).”), the agent having selected the first actions based at least on context associated with the events (see Munawar: Fig.3, [0041], “”Every experience {q, a, r} is inserted into the experience buffer 381, except the experiences where the agent dies (e.g., falling in the hole). If a similar event (same q) exist in the event buffer 382 with different action or reward, then this experience is also inserted into the event buffer 382. “extract a local state q from the environment 385 to obtain {q, s, r}.”), and a first reward function employed by the agent (see Munawar: Fig.3, [0041], “Every experience {q, a, r} is inserted into the experience buffer 381, except the experiences where the agent dies (e.g., falling in the hole). If a similar event (same q) exist in the event buffer 382 with different action or reward, then this experience is also inserted into the event buffer 382.”)

  Munawar does not explicitly teach the method wherein:
determining respective probabilities that, given the context associated with the events in the event log, different alternative reward functions would have also caused the agent to take the first actions represented in the event log;
based at least on the respective probabilities, predicting performance of the
based at least on the predicted performance of the different alternative reward functions for the evaluation metric, selecting a second reward function from the different alternative reward functions; and 
configuring the agent according to the second reward function, the second reward function causing the agent to adapt internal parameters of the agent according to the second reward function and take second actions in the environment based at least on the second reward function.

However, Faust teach the system wherein:
determining respective probabilities that given the context associated with the events in the event log different alternative reward functions would have also caused the agent to take the first actions represented in the event log (see Faust: Fig.1, [0009], “determining a given set of reward weights to use in generating parameterized atomic rewards to be processed by a critic network during training of the trained policy network. Determining the given set of reward weights includes: selecting a given actor network and a given critic network; selecting a plurality of sets of reward weights, including the given set of reward weights; performing a plurality of reward-shaping reinforcement learning trials, each using a respective instance of the given actor network and the given critic network, and using a respective one of the sets of reward weights, where performing each of the reward-shaping reinforcement learning trails.” … [0039], “This can be formalized as learning a policy that maximizes the probability of reaching the true objective, G,”) 
based at least on the respective probabilities, predicting performance of the (see Faust: Fig.1, [0011], “evaluation metrics each indicate performance of a respective one of the instances of the given actor network with respect to only a task objective reward for a robot navigation task. In some of those implementations, the robot navigation task is one of a point-to-point navigation task and a path following navigation task.” … “[0039], “This can be formalized as learning a policy that maximizes the probability of reaching the true objective, G,”)”) 
based at least on the predicted performance of the different alternative reward functions for the evaluation metric, selecting a second reward function from the different alternative reward functions (see Faust: Fig.1, [0018], “selecting the plurality of sets of reward weights includes selecting an initial subset of the sets of reward weights, and selecting an additional subset of the initial subset after performing an initial subset of the reward-shaping reinforcement learning trials using the initial subset of the sets of reward weights.”) 
Because both Munawar and Faust are in the same/similar field of endeavor of inverse reinforcement learning using agents, accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention to modify teaching of Munawar to include the system that determine respective probabilities being determined based at least on different alternative reward functions of the different alternative agent configurations as taught by TERRA. One would be motivated to make such a combination in order to improve adaptability and scalability of inverse reinforcement learning model.

Munawar and Faust does not teach the system wherein:
configuring the agent according to the second reward function, the second reward function causing the agent to adapt internal parameters of the agent according to the second reward function and take second actions in the environment based at least on the second reward function.

However, TERRA teach the system wherein:
configuring the agent according to the selected agent configuration (see TERRA: Fig.2, [0034], “configuring a reinforcement learning agent to perform an efficient reinforcement learning procedure, wherein the reinforcement learning agent comprises a model trained using a machine learning process (i.e. selected agent configuration) to determine actions to be performed by the reinforcement learning agent.”) , the second reward function causing the agent to adapt internal parameters of the agent according to the second reward function (see TERRA: Fig.2,[0035], “The reinforcement learning agent receives a reward based on whether the action changes the system in compliance with the objective (e.g. towards the preferred state), or against the objective (e.g. further away from the preferred state). The reinforcement learning agent therefore adjusts parameters in the system with the goal of maximizing the rewards received.”) and take further actions in the environment based at least on the selected reward function (see TERRA: Fig.4, 0095], “, the RL agent performs step 204 described above and uses the NN to determine an action to perform in the communications network, based on the values of the sensor data (i.e. the set of features). The action is selected based on the policy that is applied to it and the RL agent performs the determined action. The action affects the environment.”)
Because Munawar, Faust and TERRA are in the same/similar field of endeavor of reinforcement learning using agents, accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention to modify teaching of Munawar to include the system that configure the agent according to the selected agent configuration and adapt internal parameters of the agent according to the selected reward function as taught by TERRA. After modification of Munawar, the reinforcement learning exploration that exploit past experiences for critical events can incorporate the configuration configuring a reinforcement learning agent to perform an efficient reinforcement learning procedure teaching of as taught by TERRA. One would be motivated to make such a combination in order to improve adaptability and scalability of RL model training (see TERRA [0003].


Regarding Claim 4,     
Munawar, Faust and TERRA teach all the limitations of Claim 1. Munawar further teaches the system wherein:
predicting the performance of the different alternative agent configurations (see Munawar: Fig.3, [0047], “] At block 325, insert the similar events {q, a, r} where the agent has died into the event buffer 382. In this way, the system can use the information in a future scenario to help the agent or another agent survive the similar events.” i.e. predicting performance), comprises: determining, from the event log, predicted aggregate values of the evaluation metric for the plurality of alternative reward functions (see Munawar: Fig.3, [0045], “or a given experience (denoted by {local state, action, reward}, that is, {q, a, r}), determine if the agent has died. If so, then proceed to block 320. Otherwise, proceed to block 330. The condition of the agent dying can be considered a first threshold on the reward for the given experience, where if the agent has died, then the first threshold is considered to be exceeded.”). 

Regarding Claim 5, 
Munawar, Faust and TERRA teach all the limitations of Claim 4. Munawar further teach the method wherein:

determining the predicted aggregate values of the evaluation metric (see Munawar: Fig.1, [0032], “The exploitation is performed in a usual manner. During exploration, if the current state is found in the event buffer, then the actions in the event buffer are sampled with a fixed probability (ρ). However, after a given number of steps (β), the use of the event buffer stops but the random exploration continues. Therefore, during training, the next action is determined either by using the learned model (exploitation), or random exploration, or by sampling the actions for similar events in the event buffer.”), comprises: for each particular first event in the event log: 
determining a value of the particular event for the evaluation metric, the value being determined based on a particular reaction of the environment to a particular action taken in a particular context by the agent (see Munawar: Fig.3, [0045], “or a given experience (denoted by {local state, action, reward}, that is, {q, a, r}), determine if the agent has died. If so, then proceed to block 320. Otherwise, proceed to block 330. The condition of the agent dying can be considered a first threshold on the reward for the given experience, where if the agent has died, then the first threshold is considered to be exceeded.”)
weighting the value of the particular event to obtain weighted values of the evaluation metric for the different alternative reward functions, the weighting being based on corresponding probabilities that the agent would have given to the particular action with the different alternative reward functions relative to a probability that the agent actually gave to the particular action according to the first reward function (see Munawar: Fig.3, [0042], “”a probability of the sampling event buffer during exploration, (β) denotes when to stop using the event buffer (in this case, it is the maximum number of steps, ϵ.sub.min denotes a minimum probability of exploration, ϵ.sub.max denotes a maximum probability of exploration, N denotes a size of the experience buffer, and M denotes a size of the event buffer.”),; and 
aggregating the weighted values of each particular event for each of the different alternative agent configurations, the aggregating resulting in the predicted aggregate values of the evaluation metric (see TERRA: Fig.4, [0100], “The rewards modelling module performs step 206 whereby the feature importance information is used adjust the reward given to the RL agent for the action. In this embodiment a scaling is calculated based on whether the feature is a correct feature with which to have determined the action and the normal reward modelling is then multiplied by the scaling value.”)
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention to modify teaching of Munawar to include the system that aggregating the weighted values of each particular event for each alternative agent configuration to obtain the predicted aggregate values of the evaluation metric as taught by TERRA. One would be motivated to make such a combination in order to improve increase model optimization by automating reinforcement learning optimization with less human intervention. 

Regarding Claim 6, 
Munawar, Faust and TERRA all the limitations of Claim 5. Munawar further teaches the system wherein:
determining the value of the particular event based on a function for the evaluation metric (see Munawar: Fig.3, [0042], “provide next action {a} to the environment 385. Thus, during training, the next action {a} is determined by using (i) the learned model (exploitation), or (ii) random exploration, or (iii) by sampling the actions for similar events in the event buffer 382.”) 

Regarding Claim 7,  
Munawar, Faust and TERRA all the limitations of Claim 6. Munawar further teaches the system wherein:
the function maps the actions and the context to the values of the evaluation metric (see Munawar: Fig.4, [0059], “during inference, store all experiences and their respective frequencies of occurrence (with reward).”)

Regarding Claim 8,    
Munawar, Faust and TERRA all the limitations of Claim 4. Munawar further teaches the system wherein:
populating a data structure with predicted aggregate values of a plurality of evaluation metrics for the different alternative reward function (see Munawar: Fig.10, [0106], “workloads and functions which may be provided from this layer include: mapping and navigation 1091; software development and lifecycle management 1092; virtual classroom education delivery 1093; data analytics processing 1094; transaction processing 1095; and improved reinforcement learning exploration by exploiting past experiences for critical events 1096.”

Regarding Claim 9,
Munawar, Faust and TERRA all the limitations of Claim 8.  Munawar further teaches the system wherein:
the data structure comprises a table with rows representing the different alternative agent configurations associate with the different alternative reward functions and columns representing different evaluation metrics (see Munawar: Fig.10, [0103], “Hardware and software layer 1060 includes hardware and software components. Examples of hardware components include: mainframes 1061; RISC (Reduced Instruction Set Computer) architecture-based servers 1062; servers 1063; blade servers 1064; storage devices 1065; and networks and networking components 1066. In some embodiments, software components include network application server software 1067 and database software 1068.”) Examiner notes that tabular format with rows and columns is a standard data structure used to store data. 

Regarding Claim 10,    
Munawar, Faust and TERRA all the limitations of Claim 1. Munawar further teaches the system wherein:
outputting a graphical representation of the data structure (see Munawar: Fig.4, [0061]. generate and display a visualization by plotting one or more experiences using the respective rewards corresponding to the experiences as weights for visualization.”); and 
identifying second reward function based at least on user input directed to the graphical representation of the data structure (see Munawar: Fig.11, [0105]. “Metering and Pricing 1082 provide cost tracking as resources are utilized within the cloud computing environment, and billing or invoicing for consumption of these resources. In one example, these resources may include application software licenses. Security provides identity verification for cloud consumers and tasks, as well as protection for data and other resources. User portal 1083 provides access to the cloud computing environment for consumers and system administrators.

Regarding independent Claim 11,  
Claim 11 is a system claim and has similar/same claim limitation as Claim 1 and is rejected under the same rationale.

Regarding Claim 12, 
Munawar and TERRA all the limitations of Claim 11. Munawar further teaches the system wherein:
adapt the internal parameters of the agent by using the selected reward function to evaluate reactions of the environment to the further actions taken by the agent based on context describing the environment (see Munawar: Fig.3, [0053], “sample the replay buffer 383 and update model parameters for the model 384.”)

Regarding Claim 13,    
Munawar, Faust and TERRA all the limitations of Claim 12. Munawar further teaches the system wherein:
the agent comprises a linear model that determines a probability density function of expected rewards for different actions based on the selected reward function (see Munawar: Fig.1, [0031], “The exploitation is performed in a usual manner. During exploration, if the current state is found in the event buffer, then the actions in the event buffer are sampled with a fixed probability (ρ).”)

Regarding Claim 14,
Munawar, Faust and TERRA all the limitations of Claim 13. Munawar further teaches the system wherein:
the agent randomly samples from the probability density function and, in at least some instances, chooses an action that does not have the highest expected reward (see Munawar: Fig.1, [0031], “(ρ) denotes a probability of the sampling event buffer during exploration, (β) denotes when to stop using the event buffer (in this case, it is the maximum number of steps, ϵ.sub.min denotes a minimum probability of exploration, ϵ.sub.max denotes a maximum probability of exploration, N denotes a size of 312
the experience buffer, and M denotes a size of the event buffer.”

Regarding Claim 15,     
Munawar, Faust and TERRA all the limitations of Claim 14. However, Munawar further teaches the system wherein:
the agent comprising a contextual bandit (see Munawar: Fig.3, [0038], “an exploration stage 392, and a learning loop 393. The populating buffers stage 391 involves blocks 305 through 345 described below. The exploration stage 392 involves blocks 340 through 350 described below. The learning loop 393 involves block 355 described below.”)

Regarding Claim 16,   
Munawar, Faust and TERRA all the limitations of Claim 12. Munawar further teaches the system wherein:
the actions comprise recommending electronic items, the reactions indicate whether users selected the recommended electronic items, and the context comprises information about the users (see Munawar: Fig.3, [0082], “present invention can be applied to robotic applications (e.g., warehouse robots, manipulators, etc.), dialog applications (e.g., customer service, etc.), plant or data center optimization, gaming applications, and so forth.”)

Regarding Claim 17,     
Munawar, Faust and TERRA all the limitations of Claim 12. Munawar further teaches the system wherein:
changing a size of a playout buffer that stores voice over internet protocol packets for video calls (see Munawar: Fig.1, [0031], “(ρ) denotes a probability of the sampling event buffer during exploration, (β) denotes when to stop using the event buffer (in this case, it is the maximum number of steps, ϵ.sub.min denotes a minimum probability of exploration, ϵ.sub.max denotes a maximum probability of exploration, N denotes a size of the experience buffer, and M denotes a size of the event buffer.”)

Regarding Claim 18,    
Munawar, Faust and TERRA all the limitations of Claim 17. TERRA further teaches the system wherein:
the reactions indicate whether a playout buffer became empty during the video calls and the context indicates network jitter during the video calls (see Munawar: Fig.1, [0031], “(ρ) denotes a probability of the sampling event buffer during exploration, (β) denotes when to stop using the event buffer (in this case, it is the maximum number of steps, ϵ.sub.min denotes a minimum probability of exploration, ϵ.sub.max denotes a maximum probability of exploration, N denotes a size of the experience buffer, and M denotes a size of the event buffer.”)

Regarding Claim independent 19,     
	Claim 19 is a computer-readable storage medium and has similar/same claim limitation as Claim 1 and is rejected under the same rationale.

Regarding Claim independent 20,     
Munawar, Faust and TERRA all the limitations of Claim 19. Munawar further teaches the system wherein:
receiving, from the application, a reaction of the environment to the particular action (see Munawar: Fig.3, [0037], “The method/system 300 involves an experience buffer 381, an event buffer 382, a replay buffer 383, a model 384, and a target environment (hereinafter “environment” in short) 385. The replay buffer 383 is a commonly used construct in reinforcement learning algorithms. The replay buffer 383 stores <state, action, next state, reward and episode end flag> for all the actions.”
determining a reward value for the particular action based at least on the reaction and the selected reward function (see Munawar: Fig.3, [0050], “determine if exploration is to be performed. If so, then proceed to block 345. Otherwise, proceed to block 350. and 
updating internal parameters of the agent based at least on the reward value (see Munawar: Fig.3, [0056], “sample the replay buffer 383 and update model parameters for the model 384.”)

Regarding Claim 21,     
Munawar, Faust and TERRA all the limitations of Claim 19. Munawar further teaches the system wherein:
replaying the events in the event log using the different alternative reward functions to determine the predicted action probabilities for the different alternative reward functions (see Munawar: Fig.3, [0037], “The replay buffer 383 is a commonly used construct in reinforcement learning algorithms. The replay buffer 383 stores <state, action, next state, reward and episode end flag> for all the actions.”)
weighting the events in the event log based at least on a ratio between the predicted action probabilities associated with the different alternative reward functions and log agent probabilities determined by the agent for the first actions according to the first reward function (see Munawar: Fig.4 [0061], “generate and display a visualization by plotting one or more experiences using the respective rewards corresponding to the experiences as weights for visualization.”) and
predicting the performance of the different alternative reward functions based at least on the weighting (see Munawar: Fig.3, [0055], “provide next action {a} to the environment 385. Thus, during training, the next action {a} is determined by using (i) the learned model (exploitation), or (ii) random exploration, or (iii) by sampling the actions for similar events in the event buffer 382.”)

Regarding Claim 22,     
Munawar, Faust and TERRA all the limitations of Claim 21. Munawar further teaches the system wherein:
the predicted action probabilities are determined and the second reward function is selected prior to deploying the agent with the second reward function (see Munawar: Fig.3, [0032], “The exploitation is performed in a usual manner. During exploration, if the current state is found in the event buffer, then the actions in the event buffer are sampled with a fixed probability (ρ). However, after a given number of steps (β), the use of the event buffer stops but the random exploration continues. Therefore, during training, the next action is determined either by using the learned model (exploitation), or random exploration, or by sampling the actions for similar events in the event buffer.”)

Response to Arguments
Claim Rejections - 35 U.S.C. § 101, 
Regarding the 35 U.S.C. 101 rejection for being directed non-statutory subject matter has been updated based on applicant amendments and. Therefore, the 35 U.S.C. 101 rejection has been sustained.
Regarding the 35 U.S.C. 101 rejection for being directed signal per se which is non-statutory subject matter has been withdrawn based on applicant argument.  
Examiner notes that para. [0098] of applicant disclosure “[0098] the term "computer-readable media" can include signals. In contrast, the term "computer-readable storage media" excludes signals.”. Accordingly, Under the BRI, the claimed storage medium is limited to no transitory and does not include single pr se. 

Claim Rejections - 35 U.S.C. § 103,
Applicant’s arguments with respect to claim amendments have been considered but are moot considering the new combination of references being used in the current rejection. The new combination of references was necessitated by Applicant’s claim amendments. Therefore, the claims are rejected under the new combination of references as indicated above.

Conclusion
	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
PGPUB
 NUMBER:
INVENTOR-INFORMATION:
TITLE / DESCRIPTION
US 11790303 B2
Minter; Vishal Sean
Title: Analyzing Agent Data and Automatically Delivering Actions
Description: This invention relates generally to computing devices and, more particularly to a system to monitor and process employee data resulting in actions delivered directly to employees or to a person in management.
US 11710072 B1
Mahasseni; Behrooz
Title: Inverse Reinforcement Learning For User-specific Behaviors
Description: The present disclosure generally relates to tailoring virtual agent (VA) behaviors for a specific user and, in particular, to systems, methods, and methods for inverse reinforcement learning (IRL) for tailoring virtual agent behaviors to a specific user


Any inquiry concerning this communication or earlier communications from the examiner should be directed to ZELALEM W SHALU whose telephone number is (571)272-3003. The examiner can normally be reached M- F 0800am- 0500pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Cesar Paula can be reached at (571) 272-4128. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/Zelalem Shalu/Examiner, Art Unit 2145        


/CESAR B PAULA/Supervisory Patent Examiner, Art Unit 2145
Read full office action
Prosecution Timeline

Show 3 earlier events
Sep 02, 2025
Applicant Interview (Telephonic)
Oct 06, 2025
Response Filed
Nov 06, 2025
Final Rejection mailed — §101, §103
Dec 08, 2025
Applicant Interview (Telephonic)
Dec 09, 2025
Examiner Interview Summary
Jan 21, 2026
Request for Continued Examination
Jan 28, 2026
Response after Non-Final Action
Mar 30, 2026
Non-Final Rejection mailed — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/481,160
Patent 12619879
TRAINING NEURAL NETWORKS USING LEARNED OPTIMIZERS
4y 7m to grant Granted May 05, 2026
17/827,588
Patent 12477016
AUTOMATION OF VISUAL INDICATORS FOR DISTINGUISHING ACTIVE SPEAKERS OF USERS DISPLAYED AS THREE-DIMENSIONAL REPRESENTATIONS
3y 5m to grant Granted Nov 18, 2025
17/808,093
Patent 12468969
METHODS FOR CORRELATED HISTOGRAM CLUSTERING FOR MACHINE LEARNING
3y 4m to grant Granted Nov 11, 2025
15/770,665
Patent 12419611
PATIENT MONITOR, PHYSIOLOGICAL INFORMATION MEASUREMENT SYSTEM, PROGRAM TO BE USED IN PATIENT MONITOR, AND NON-TRANSITORY COMPUTER READABLE MEDIUM IN WHICH PROGRAM TO BE USED IN PATIENT MONITOR IS STORED
7y 5m to grant Granted Sep 23, 2025
17/344,053
Patent 12153783
User Interfaces and Methods for Generating a New Artifact Based on Existing Artifacts
3y 5m to grant Granted Nov 26, 2024
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
29%
Grant Probability
49%
With Interview (+20.3%)
3y 6m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 110 resolved cases by this examiner. Grant probability derived from career allowance rate.