Last updated: April 19, 2026
Application No. 18/345,709
APPARATUS AND METHOD FOR EXPLORING OPTIMIZED TREATMENT PATHWAY THROUGH MODEL-BASED REINFORCEMENT LEARNING BASED ON SIMILAR EPISODE SAMPLING

Final Rejection §101§103
Filed
Jun 30, 2023
Examiner
KOLOSOWSKI-GAGER, KATHERINE
Art Unit
3687
Tech Center
3600 — Transportation & Electronic Commerce
Assignee
ELECTRONICS AND TELECOMMUNICATIONS RESEARCH INSTITUTE
OA Round
2 (Final)
Interview Optional

— +33.6% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 358 resolved cases, 2023–2026
Examiner Intelligence

KOLOSOWSKI-GAGER, KATHERINE View full profile →
Grants only 26% of cases
Career Allow Rate
95 granted / 358 resolved
-25.5% vs TC avg
Strong +34% interview lift
Without
With
+33.6%
Interview Lift
resolved cases with interview
Typical timeline
4y 3m
Avg Prosecution
54 currently pending
Career history
412
Total Applications
across all art units
Statute-Specific Performance

§101
35.0%
-5.0% vs TC avg
§103
33.9%
-6.1% vs TC avg
§102
14.5%
-25.5% vs TC avg
§112
12.5%
-27.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 358 resolved cases
Office Action

§101 §103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION

This action is in reference to the communication as field by Applicant on 2 SEPT 2025. 
Amendments to claims 1, 3-5, 7, 9-11 have been entered and considered. 
Claims 1-11 are present and have been examined. 

Claim Rejections - 35 USC § 101

35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-11 rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter. As explained below, the claim(s) are directed to an abstract idea without significantly more. 

	Step One: Is the Claim directed to a process, machine, manufacture or composition of matter? YES 

	With respect to claim(s) 1-11  the independent claim(s) 1, 11 recite(s) an apparatus and a system, i.e. an article of manufacture, and a machine, each of which is a statutory category of invention. 

	Step 2A – Prong One: Is the claim directed to a law of nature, a natural phenomenon (product of nature) or an abstract idea? YES
With respect to claim(s) 1-11 the independent claim(s) (claims 1, 7, 11) is/are directed, in part, to:



wherein the treatment pathway exploring 


 predict an optimized treatment method and optimized timing of treatment capable of maximizing the expected value of the reward of the target patient and provide the patient state prediction with the current state of the target patient and the treatment method to obtain the next state of the target patient and the reward; and

wherein the state value evaluation model learns a function Q for predicting the expected value of the reward, 
wherein the state value evaluation module is configured to learn the function Q such that an expected value of the Q function using the virtual EMR episode is induced to be low, and an expected value of the function Q using the extracted EMR episode is induced to be high; and 
wherein parameters of a model for state value evaluation are updated such that the function Q is reinforced in a direction of selection by minimizing a difference between the expected value of the function Q using the virtual EMR episode and the expected value of the function Q using the extracted EMR episode 
These claim elements are considered to be abstract ideas because they are directed to mental processes in that the claims ensconce concepts performed in the human mind including observation, evaluation, judgment, and opinion functions. Predicting a treatment method, receiving a current state and treatment method…, receiving an EMR, calculating a similarity between the current/target patient…, predicting an expected value of a reward…, predicting an optimized treatment method…, and generating a new EMR based on that information, are all examples of concepts performed in the human mind as identified above. If a claim limitation, under its broadest reasonable interpretation, covers a concept performed in the human mind, then it/they falls/ fall into the “mental processes” category.
The claims are further directed to mathematical concepts – i.e. mathematical relationships, formulas, equations, and/or calculations. Calculating a similarity between a group of variables, and arguably predicting/ generating rewards values based on variables, and using the learned function Q in order to determine an expected value and compare/set said expected values are each examples of mathematical concepts as identified above.  If a claim limitation, under its broadest reasonable interpretation, covers mathematical relationships, formulas, equations, and/or calculations, then it/they falls/ fall into the “mathematical processes” category.
Accordingly, the claim recites an abstract idea.

Step 2A – Prong Two: Does the claim recite additional elements that integrate the judicial exception into a practical application? NO.
	This judicial exception is not integrated into a practical application. In particular, the claim(s) recite(s) additional element – claims 1, 11 both recite a plurality of “modules” perform the claims steps as identified above. Claim 11 recites a “treatment pathway exploring device” as well as a “patient state prediction device.” Each of the claims indicates that the medical records are “electronic” and some of the EMR episodes or related elements are “virtual.” In the interest of compact prosecution, Examiner notes claims 1, 7, 11, each at least nominally recite sending and receiving data. 
	The modules in claims 1, 11, as well as the devices in claim 11 are recited at a high level of generality and as such amount to no more than adding the words “apply it” to the judicial exception, or mere instructions to implement the abstract idea on a computer, or merely uses the computer as a tool to perform the abstract idea (see MPEP 2106.05f), or generally links the use of the judicial exception to a particular technological field of use/computing environment (see MPEP 2106.05h). Similarly, Examiner finds that the use of the electronic records and virtual episodes are also merely a general link between the use of the judicial exception(s) to a particular technological environment or field of use (see MPEP 2106.05h). Examiner also notes that sending and receiving data is found to be merely adding insignificant extra solution activity to the judicial exception(s) identified (see MPEP 2106.05g). Examiner finds these elements do not constitute an improvement to the functioning of the computer or any other technology or technical field as claimed (see MPEP 2106.05a), nor any other application or use of the judicial exception in some meaningful way beyond a general like between the use of the judicial exception to a particular technological environment (see MPEP 2106.05e). Accordingly, this/these additional element(s) do(es) not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.

Step 2B: Does the claim recite additional elements that amount to significantly more than the judicial exception? NO.
The independent claim(s) is/are additionally directed to claim elements - – claims 1, 11 both recite a plurality of “modules” perform the claims steps as identified above. Claim 11 recites a “treatment pathway exploring device” as well as a “patient state prediction device.” Each of the claims indicates that the medical records are “electronic” and some of the EMR episodes or related elements are “virtual.” In the interest of compact prosecution, Examiner notes claims 1, 7, 11, each at least nominally recite sending and receiving data. 
When considered individually, the “modules,” the “devices,” the electronic records and virtual EMR episodes, as well as the sending/receiving of data only contribute generic recitations of technical elements to the claims. It is readily apparent, for example, that the claim is not directed to any specific improvements of these elements. Examiner looks to Applicant’s specification in 
[0022] In the detailed description, components described with reference to the terms “unit”, “module”, “block”, “-er or -or”, etc. and function blocks illustrated in drawings will be implemented with software, hardware, or a combination thereof. Illustratively, the software may be a machine code, firmware, an embedded code, and application software. For example, the hardware may include an electrical circuit, an electronic circuit, a processor, a computer, an integrated circuit, integrated circuit cores, a pressure sensor, an inertial sensor, a microelectromechanical system (MEMS), a passive element, or a combination thereof.
[0034] FIG. 3 conceptually illustrates an operation of a treatment pathway exploring device 110 of FIG. 2. As described above with reference to FIG. 2, the operation of the treatment pathway exploring device 110 may be performed by an interaction between an episode sampling module 111, a state value evaluation module 112, a treatment method learning module 113, a virtual episode generation module 114, and a patient state prediction device 120.
[0025] FIG. 1 is a block diagram illustrating a configuration of a system 100 for exploring an optimized treatment pathway according to an embodiment of the present disclosure. Referring to FIG. 1, the system 100 may include a treatment pathway exploring device 110 and a patient state prediction device 120. The treatment pathway exploring device 110 may receive an electronic medical record (EMR) of patients from an EMR database (DB) 10. The EMR may be records about examination and treatment of all patients who visit a medical institution, which are stored in the form of a time series over time. The treatment pathway exploring device 110 may convert such an EMR into a time series episode which is in the form of a patient state S, a treatment method A, a reward R, a time T, and a next state S′ of a patient. The EMR converted into the episode is referred to as an EMR episode.
[027] …Functions of the treatment pathway exploring device 110 and the patient state prediction device 120 may be implemented using hardware, including combination logic, which executes instructions stored in any type of memory (e.g., a flash memory, such as a NAND flash memory or a low-latency NAND flash memory, a persistent memory (PMEM), such as a cross-grid non-volatile memory, a memory with bulk resistance variation, a phase change memory (PCM), or the like or a combination thereof), sequential logic, one or more timers, counters, registers, state machines, one or more complex programmable logic devices (CPLDs), a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), a central processing unit (CPU), such as complex instruction set computer (CSIC) processors such as x86 processors and/or a reduced instruction set computer (RISC) such as ARM processors, a graphics processing unit (GPU), a neural processing unit (NPU), a tensor processing unit (TPU), an accelerated processing unit (APU), or the like, or a combination thereof, software, or a combination thereof.
  These passages, as well as others, makes it clear that the invention is not directed to a technical improvement.   When the claims are considered individually and as a whole, the additional elements noted above, appear to merely apply the abstract concept to a technical environment in a very general sense – i.e. a generic computer receives information from another generic computer, processes the information and then sends information back. The most significant elements of the claims, that is the elements that really outline the inventive elements of the claims, are set forth in the elements identified as an abstract idea.   The fact that the generic computing devices are facilitating the abstract concept is not enough to confer statutory subject matter eligibility.
Examiner finds no specific paragraphs to reference regarding the electronic medical records, virtual episodes, and/or the general sending and receiving elements, and therefore further concludes these are again at best examples of a general link between the use of the judicial exception to a particular technological environment or field of use (MPEP 2106.05h), and/or insignificant extra solution activity as discussed above. 

As per dependent claims 2-6, 8-10: 
Dependent claims 2, 3, 4, 5, 6, 8, 9, 10 are not directed any additional abstract ideas and are also not directed to any additional non-abstract claim elements than those identified above. Rather, these claims offer further descriptive limitations of elements found in the independent claims and addressed above – such as the types or outcomes of a mental process, as well as the mathematical relationships/calculations used.  While these descriptive elements may provide further helpful context for the claimed invention these elements do not serve to confer subject matter eligibility to the invention since their individual and combined significance is still not heavier than the abstract concepts at the core of the claimed invention.



Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1, 2, 4-8, 10, 11 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wang (US 20200373017 A1) in view of Dalli et al (US 20220147876 A1, hereinafter Dalli). 

In reference to claim 1:
Wang teaches: An apparatus for exploring an optimized treatment pathway of a target patient, the apparatus comprising a processor and code configured to implement the following:
an episode sampling module configured to receive a virtual electronic medical record (EMR) episode, calculate a similarity between a first current state of the target patient, the target patient corresponding to the received virtual EMR episode, and a second current state of a patient, the patient corresponding to each of a plurality of EMR episodes, extract an EMR episode in which the calculated similarity is highest among the plurality of EMR episodes, and output a pair of the virtual EMR episode and the extracted EMR episode (at least [fig 1 and related text], including [046] “That input will generally relate to a single health-related episode, and could be, for example, a clinical record, or medical claims records. While in some cases, records from multiple patients or doctors may be input from a remote computing device, in other instances, the input received from a remote computing device consists of health-related information of a single patient.” At [048-049] “ the second step is modeling a decision-making process as a probabilistic dynamic state-transition process comprising a series of states and actions, and generate a kernel function that captures the similarity between any two states, which is an approximate divergence function between their future distributions of pathways (it can be viewed as a restricted diffusion distance customized to the specific probabilistic process and computed from data), based on sequential clinical records…sequential clinical records will contain records from multiple patients (preferably at least 100 patients, more preferably at least 1,000 patients who went through identical or highly similar treatment episodes, at [0187-0188, fig 7B and related text] “state- action” pairs are collected/extracted);
a state value evaluation module configured to predict an expected value of a reward when performing a specific treatment method for the first current state of the target patient, based on the pair of the virtual EMR episode and the extracted EMR episode (at least [fig 1 and related text including 052] “The method also includes determining 120 an empirical transition model and an empirical reward function based on the sequential clinical records and providing statistical confidence bounds for the estimated transition model and its predictions. Said differently, the optimized state feature/kernel functions are used to determine the empirical transition model and the reward function, in either parametric and nonparametric representations. From there, the method involves generating 125 a sequence of converging solutions towards an optimal value function and an optimal decision policy that maps the optimal decision for each possible state. Then, the method includes obtaining 130 one or more policies based on at least one of the converging solutions, the one or more policies can be used to prescribe actions sequentially along a series of states. In some embodiments, the one or more policies are based on an insurance-related factor. In some embodiments, the one or more policies includes a policy having the lowest overall cost or a policy where each claim has a cost below a predetermined threshold.”);
a treatment method learning module configured to predict an optimized treatment method and optimized timing of treatment capable of maximizing the expected value of the reward of the target patient and provide an external prediction model with the first current state of the target patient and the optimized treatment method to obtain a next state and reward of the target patient (at least [047] “. It is envisioned that there are typically two types of input. A first type of input is a large amount of data (e.g., a database or data set) for training the machine learning algorithm. Given such input, the algorithm will compute the state feature/kernel function and the optimized low-order empirical transition model. The second type of input is records (clinical records and/or medical claims) about a new patient. In this part, the algorithm will generate predictions and optimal policies for this single patient.” At [069-071] “A reinforcement learning-based approach was developed to learn the optimal policy from episodic claims. Reinforcement learning is a branch of artificial intelligence. It is well-suited for the sequential, uncertain and variable nature of clinical decision making. By modeling the knee replacement episode as a Markov decision process, a reinforcement learning pipeline to find the optimal treatment decision at every possible state, that takes its long-term effect into account, was developed.” See also [fig 8 and related text]); and
a virtual episode generation module configured to generate a new virtual EMR episode based on the optimized treatment method, the optimized timing of treatment, the next state, and the reward of the target patient  (at least [0176] “ In this environment, each game frame o.sub.t is a 210×160×3 image. In each interactive step the agent takes in the last 16 frames and preprocesses them to be the input state s.sub.t=ϕ({o.sub.t−i}.sub.i=0.sup.15). The state s.sub.t is an 84×84×4 resealed, grey-scale image, and is the input to the neural network Q (s.sub.t,⋅;θ). The first convolution layer in the network has 32 filters of size 8 stride 4, the second layer has 64 layers of size 4 stride 2, the final convolution layer has 64 filters of size 3 stride 1, and is followed by a fully-connected hidden layer of 512 units. The output is another fully-connected layer with six units that correspond to the six action values {Q(s.sub.t, a.sub.i;θ)}.sub.i=1.sup.6. The agent selects an action based on these state-action values, repeats the selected action four times, observes four subsequent frames {o.sub.t+i}.sub.i=1.sup.4 and receives an accumulated reward r.sub.t.” at [054] “In some embodiments, the method also includes using either one or more optimized policies obtained in the previous step or any user-specified policy to compute and predict the posterior probability distribution of two or more most likely clinical pathways that a patient would go through conditioned on the patient's current records, and/or using either one or more optimized policies obtained in the previous step or any user-specified policy to compute and predict the overall financial cost or other specified outcome and the estimated value's statistical confidence region.” At [096] iterative updates, and at [fig 1 and related text] the iterative updating process of parsing/adding records to the EMR). Wang as cited teaches all of the elements above, but does not specifically teach q-learning. Dalli however does teach: 
wherein the state value evaluation model learns a function Q for predicting the expected value of the reward (at [least [013-019] “Q-Learning is a traditional and widely used RL method, first introduced in 1989 by Walkins. It is typically used as the baseline for benchmarking RL methods. The goal of Q-Learning is to train an agent which is capable of interacting with the environment such that some function Q is approximated from a stream of data points identified by <s, a, r, s′>; where s represents the state, which is an observation from the environment, a represents the action, r represents the reward from the environment which is a measure of how good the action is, and s′ represents the next state when transitioning in the environment. The Q function is equal to the expected value of the sum of future rewards with some policy Π.”); 
wherein the state value evaluation module is configured to learn the function Q such that an expected value of the Q function using the virtual episode is induced to be low and an expected value of the function Q using the extracted episode is induced to be high, (at least [022]  “Gradient descent may be applied to minimize the loss function. Double Q-Learning, in general, may utilize two value functions that are learned by assigning experiences randomly to update one of the two value functions, resulting in two sets of weights. During each update, one set of weights is used to determine the greedy policy and the other to determine its value. Other variants of Deep QL, known as Double Deep Q-Network (DQN), includes using two neural networks to perform the Bellman iteration, one for generating the prediction and another one for generating the target. It is further contemplated that the weights of the second network are replaced with the weights of the first network to perform greedy evaluation of the current policy. This helps in reducing bias which may be introduced by the inaccuracies of the Q network.”  - i.e. double q learning allows for two sets of tables to be set to a more accurate overall bias in the calculation;  See also [016-019] for discussion of standard Q-learning/”optimistic”  initial conditions, i.e. higher  expected values),
wherein the parameters of a model for state value evaluation are updated such that the function Q is reinforced in a direction of selection by minimizing a difference between the expected value of the function Q using the virtual episode and the expected value of the function Q using the extracted episode (at least [fig 11 and related text] Sample from simulator model 1810/sample from environment 1880 i.e. virtual and extracted episodes, at   [fig 13 and related text] “The difference between the estimated fare amount 6030 and the actual fare amount 6040 is calculated in a difference estimation process 6020, which then outputs the fare estimation difference 6010. The fare estimation difference 6010 is used as environmental learning input to the XRL agent, and may be used in a predictive coding manner to improve the estimation accuracy for subsequent trips.”). Dalli and Wang are analogous as both references disclose a means of modeling comparing virtual and actual event tables in order to maximize a reward function. One of ordinary skill would have found the combination of the Q learning as taught by Dalli to be obvious to include in the reinforcement learning technique of Wang, as Dalli teaches specifically “The objective of an RL agent is to select actions to maximize total expected future reward, as defined by the optimality metric on the reward stream…A popular strategy for an agent is to maximize the (discounted) future rewards. Discounting is a mathematical adjustment which caters to environmental stochasticity. RL may be model-based, value-based, policy-based, actor-critic-based, or model-free.” – i.e. RL and specifically Q learning is a “traditional and widely used RL method” (see 003, 016) and as such, one would have found the inclusion of Q learning to be a combination of known methods to yield predictable results. 



In reference to claim 2, 8:
Wang teaches: wherein the episode sampling module calculates the similarity using any one of a mean square error (MSE) similarity or a cosine similarity (at least [0133-0145] KME modeling to determine similarity). 

In reference to claim 4:
Wang further teaches: wherein the treatment method learning module includes a real-time treatment method recommendation network configured to receive the current state of the target patient and output the treatment method (at least [076] “In the case of knee replacement, each claim is modeled as a time step. The knee replacement process may last for an indefinitely number of time steps. “Recovery” is modeled as an absorbing terminal state, at which no future transition or cost will be generated. According to the data set, all episodes terminate at the recovery state. In a given episode, ideally speaking, a state s.sub.t is a collection of claims up to the time step t, and an action a.sub.t is picked from all possible prescriptions. At time t, a physician examines the current state s.sub.t∈S of a patient, chooses an action a.sub.t∈A according to his/her own expertise and then the patient moves to the next state s.sub.t+1 according to some probability P(s.sub.t, a.sub.t , s.sub.t+1). Each claim may generate a cost C(s.sub.t, a.sub.t).” see also [0106, 0111]).

In reference to claim 5:
Wang further teaches: wherein the treatment method learning module provides a patient state prediction device with the current state and the treatment method with respect to a plurality of time points to obtain a next state of the target patient, the next state corresponding to each of the plurality of time points, predicts a time point when there is a maximum value of the expected value of the reward calculated based on each of the obtained next states and the treatment method as the optimized timing of treatment, and updates the real-time treatment method recommendation network based on the maximum value of the expected value of the reward and predicts a treatment method being output by inputting the current state of the target patient to the updated network as the optimized treatment method (at least [076] “In the case of knee replacement, each claim is modeled as a time step. The knee replacement process may last for an indefinitely number of time steps. “Recovery” is modeled as an absorbing terminal state, at which no future transition or cost will be generated. According to the data set, all episodes terminate at the recovery state. In a given episode, ideally speaking, a state s.sub.t is a collection of claims up to the time step t, and an action a.sub.t is picked from all possible prescriptions. At time t, a physician examines the current state s.sub.t∈S of a patient, chooses an action a.sub.t∈A according to his/her own expertise and then the patient moves to the next state s.sub.t+1 according to some probability P(s.sub.t, a.sub.t , s.sub.t+1). Each claim may generate a cost C(s.sub.t, a.sub.t).” at [0106] “as a simple example, one could forecast the pathway from day 25 conditioned on the event that the current diagnosis belongs to the category of osteoarthritis. One graphical representation of this forecasted pathway can be seen in FIG. 3B. Other examples include forecasts of pathways from day 50 or 75, conditioned on the event that the current diagnosis belongs to the category of osteoarthritis, or forecasts of pathways from day 50, conditioned on the event that the current diagnosis belongs to the category of other non-traumatic joint disorders. One can see that the model is able to predict the conditional distributions of future diagnosis and treatments given any particular time and state of the treatment. These forecasts can make physicians better informed while making clinical decisions. In some embodiments, the model also predicts a financial cost for a current treatment, a financial risk for one or more future treatments, number of future hospital visits, hospitalization duration, health condition at the end of episode, other user-specified metrics, or a combination thereof... The method may optionally also includes determining 150 a treatment option for a patient based on the displayed one or more policies and/or pathways. For example, a physician for a sports team could use the method to determine how to help a player on the sports team recover fastest, even if it costs slightly more money than another policy.” At [0123] “ Preferably, the one or more processors are further configured to predict various outcomes, including a financial cost for a current treatment, a financial risk for one or more future treatments, number of future hospital visits, hospitalization duration, health condition at the end of the episode, other user-specified outcomes or a combination thereof.”).
In reference to claim 6:
Wang further teaches: wherein the real-time treatment method recommendation network is updated to select a treatment method for maximizing the expected value of the reward among a plurality of treatment methods (at least [0123] “Preferably, the one or more processors are further configured to predict various outcomes, including a financial cost for a current treatment, a financial risk for one or more future treatments, number of future hospital visits, hospitalization duration, health condition at the end of the episode, other user-specified outcomes or a combination thereof.”; see also [0106, 0111]).

In reference to claim 7
Wang teaches: A method for exploring an optimized treatment pathway of a target patient, the method comprising:
calculating a similarity between a first current state of the target patient, the target patient corresponding to a received virtual electronic medical record (EMR) episode, and a second current state of a patient, the patient corresponding to each of a plurality of EMR episodes, extracting an EMR episode in which the calculated similarity is highest among the plurality of EMR episodes, and outputting a pair of the virtual EMR episode and the extracted EMR episode (at least [fig 1 and related text], including [046] “That input will generally relate to a single health-related episode, and could be, for example, a clinical record, or medical claims records. While in some cases, records from multiple patients or doctors may be input from a remote computing device, in other instances, the input received from a remote computing device consists of health-related information of a single patient.” At [048-049] “ the second step is modeling a decision-making process as a probabilistic dynamic state-transition process comprising a series of states and actions, and generate a kernel function that captures the similarity between any two states, which is an approximate divergence function between their future distributions of pathways (it can be viewed as a restricted diffusion distance customized to the specific probabilistic process and computed from data), based on sequential clinical records…sequential clinical records will contain records from multiple patients (preferably at least 100 patients, more preferably at least 1,000 patients who went through identical or highly similar treatment episodes, at [0187-0188, fig 7B and related text] “state- action” pairs are collected/extracted);
predicting an expected value of a reward when performing a specific treatment method for the first current state of the target patient, based on the pair of the virtual EMR episode and the extracted EMR episode (at least [fig 1 and related text including 052] “The method also includes determining 120 an empirical transition model and an empirical reward function based on the sequential clinical records and providing statistical confidence bounds for the estimated transition model and its predictions. Said differently, the optimized state feature/kernel functions are used to determine the empirical transition model and the reward function, in either parametric and nonparametric representations. From there, the method involves generating 125 a sequence of converging solutions towards an optimal value function and an optimal decision policy that maps the optimal decision for each possible state. Then, the method includes obtaining 130 one or more policies based on at least one of the converging solutions, the one or more policies can be used to prescribe actions sequentially along a series of states. In some embodiments, the one or more policies are based on an insurance-related factor. In some embodiments, the one or more policies includes a policy having the lowest overall cost or a policy where each claim has a cost below a predetermined threshold.”); 
predicting an optimized treatment method and an optimized timing of treatment capable of maximizing the expected value of the reward of the target patient (at least [047] “ It is envisioned that there are typically two types of input. A first type of input is a large amount of data (e.g., a database or data set) for training the machine learning algorithm. Given such input, the algorithm will compute the state feature/kernel function and the optimized low-order empirical transition model. The second type of input is records (clinical records and/or medical claims) about a new patient. In this part, the algorithm will generate predictions and optimal policies for this single patient.” At [069-071] “A reinforcement learning-based approach was developed to learn the optimal policy from episodic claims. Reinforcement learning is a branch of artificial intelligence. It is well-suited for the sequential, uncertain and variable nature of clinical decision making. By modeling the knee replacement episode as a Markov decision process, a reinforcement learning pipeline to find the optimal treatment decision at every possible state, that takes its long-term effect into account, was developed.” See also [fig 8 and related text]);
 providing an external prediction model with the current state of the target patient and the treatment method to obtain a next state and reward of the target patient (at least [047] “ It is envisioned that there are typically two types of input. A first type of input is a large amount of data (e.g., a database or data set) for training the machine learning algorithm. Given such input, the algorithm will compute the state feature/kernel function and the optimized low-order empirical transition model. The second type of input is records (clinical records and/or medical claims) about a new patient. In this part, the algorithm will generate predictions and optimal policies for this single patient.” At [069-071] “A reinforcement learning-based approach was developed to learn the optimal policy from episodic claims. Reinforcement learning is a branch of artificial intelligence. It is well-suited for the sequential, uncertain and variable nature of clinical decision making. By modeling the knee replacement episode as a Markov decision process, a reinforcement learning pipeline to find the optimal treatment decision at every possible state, that takes its long-term effect into account, was developed.” See also [fig 8 and related text]); and 
generating a new virtual EMR episode based on the treatment method, the timing of treatment, the next state, and the reward (at least [0176] “ In this environment, each game frame o.sub.t is a 210×160×3 image. In each interactive step the agent takes in the last 16 frames and preprocesses them to be the input state s.sub.t=ϕ({o.sub.t−i}.sub.i=0.sup.15). The state s.sub.t is an 84×84×4 resealed, grey-scale image, and is the input to the neural network Q (s.sub.t,⋅;θ). The first convolution layer in the network has 32 filters of size 8 stride 4, the second layer has 64 layers of size 4 stride 2, the final convolution layer has 64 filters of size 3 stride 1, and is followed by a fully-connected hidden layer of 512 units. The output is another fully-connected layer with six units that correspond to the six action values {Q(s.sub.t, a.sub.i;θ)}.sub.i=1.sup.6. The agent selects an action based on these state-action values, repeats the selected action four times, observes four subsequent frames {o.sub.t+i}.sub.i=1.sup.4 and receives an accumulated reward r.sub.t.” at [054] “In some embodiments, the method also includes using either one or more optimized policies obtained in the previous step or any user-specified policy to compute and predict the posterior probability distribution of two or more most likely clinical pathways that a patient would go through conditioned on the patient's current records, and/or using either one or more optimized policies obtained in the previous step or any user-specified policy to compute and predict the overall financial cost or other specified outcome and the estimated value's statistical confidence region.” At [096] iterative updates, and at [fig 1 and related text] the iterative updating process of parsing/adding records to the EMR). 
Wang as cited teaches all of the elements above, but does not specifically teach q-learning. Dalli however does teach: 
wherein the state value evaluation model learns a function Q for predicting the expected value of the reward (at [least [013-019] “Q-Learning is a traditional and widely used RL method, first introduced in 1989 by Walkins. It is typically used as the baseline for benchmarking RL methods. The goal of Q-Learning is to train an agent which is capable of interacting with the environment such that some function Q is approximated from a stream of data points identified by <s, a, r, s′>; where s represents the state, which is an observation from the environment, a represents the action, r represents the reward from the environment which is a measure of how good the action is, and s′ represents the next state when transitioning in the environment. The Q function is equal to the expected value of the sum of future rewards with some policy Π.”); 
wherein the state value evaluation module is configured to learn the function Q such that an expected value of the Q function using the virtual episode is induced to be low and an expected value of the function Q using the extracted episode is induced to be high, (at least [022]  “Gradient descent may be applied to minimize the loss function. Double Q-Learning, in general, may utilize two value functions that are learned by assigning experiences randomly to update one of the two value functions, resulting in two sets of weights. During each update, one set of weights is used to determine the greedy policy and the other to determine its value. Other variants of Deep QL, known as Double Deep Q-Network (DQN), includes using two neural networks to perform the Bellman iteration, one for generating the prediction and another one for generating the target. It is further contemplated that the weights of the second network are replaced with the weights of the first network to perform greedy evaluation of the current policy. This helps in reducing bias which may be introduced by the inaccuracies of the Q network.”  - i.e. double q learning allows for two sets of tables to be set to a more accurate overall bias in the calculation;  See also [016-019] for discussion of standard Q-learning/”optimistic”  initial conditions, i.e. higher  expected values),
wherein the parameters of a model for state value evaluation are updated such that the function Q is reinforced in a direction of selection by minimizing a difference between the expected value of the function Q using the virtual episode and the expected value of the function Q using the extracted episode (at least [fig 11 and related text] Sample from simulator model 1810/sample from environment 1880 i.e. virtual and extracted episodes, at   [fig 13 and related text] “The difference between the estimated fare amount 6030 and the actual fare amount 6040 is calculated in a difference estimation process 6020, which then outputs the fare estimation difference 6010. The fare estimation difference 6010 is used as environmental learning input to the XRL agent, and may be used in a predictive coding manner to improve the estimation accuracy for subsequent trips.”). Dalli and Wang are analogous as both references disclose a means of modeling comparing virtual and actual event tables in order to maximize a reward function. One of ordinary skill would have found the combination of the Q learning as taught by Dalli to be obvious to include in the reinforcement learning technique of Wang, as Dalli teaches specifically “The objective of an RL agent is to select actions to maximize total expected future reward, as defined by the optimality metric on the reward stream…A popular strategy for an agent is to maximize the (discounted) future rewards. Discounting is a mathematical adjustment which caters to environmental stochasticity. RL may be model-based, value-based, policy-based, actor-critic-based, or model-free.” – i.e. RL and specifically Q learning is a “traditional and widely used RL method” (see 003, 016) and as such, one would have found the inclusion of Q learning to be a combination of known methods to yield predictable results. 

In reference to claim 10:
Wang further teaches: wherein the predicting of the optimized treatment method and the optimized timing of treatment includes: 
inputting the current state of the target patient to a real-time treatment method recommendation network and outputting the treatment method  (at least [076] “In the case of knee replacement, each claim is modeled as a time step. The knee replacement process may last for an indefinitely number of time steps. “Recovery” is modeled as an absorbing terminal state, at which no future transition or cost will be generated. According to the data set, all episodes terminate at the recovery state. In a given episode, ideally speaking, a state s.sub.t is a collection of claims up to the time step t, and an action a.sub.t is picked from all possible prescriptions. At time t, a physician examines the current state s.sub.t∈S of a patient, chooses an action a.sub.t∈A according to his/her own expertise and then the patient moves to the next state s.sub.t+1 according to some probability P(s.sub.t, a.sub.t , s.sub.t+1). Each claim may generate a cost C(s.sub.t, a.sub.t).” see also [0106, 0111]); 
providing a patient state prediction device with the current state and the treatment method with respect to a plurality of time points to obtain a next state of the target patient, the next state corresponding to each of the plurality of time points (at least [076] “In the case of knee replacement, each claim is modeled as a time step. The knee replacement process may last for an indefinitely number of time steps. “Recovery” is modeled as an absorbing terminal state, at which no future transition or cost will be generated. According to the data set, all episodes terminate at the recovery state. In a given episode, ideally speaking, a state s.sub.t is a collection of claims up to the time step t, and an action a.sub.t is picked from all possible prescriptions. At time t, a physician examines the current state s.sub.t∈S of a patient, chooses an action a.sub.t∈A according to his/her own expertise and then the patient moves to the next state s.sub.t+1 according to some probability P(s.sub.t, a.sub.t , s.sub.t+1). Each claim may generate a cost C(s.sub.t, a.sub.t).” at [0106] “as a simple example, one could forecast the pathway from day 25 conditioned on the event that the current diagnosis belongs to the category of osteoarthritis. One graphical representation of this forecasted pathway can be seen in FIG. 3B. Other examples include forecasts of pathways from day 50 or 75, conditioned on the event that the current diagnosis belongs to the category of osteoarthritis, or forecasts of pathways from day 50, conditioned on the event that the current diagnosis belongs to the category of other non-traumatic joint disorders. One can see that the model is able to predict the conditional distributions of future diagnosis and treatments given any particular time and state of the treatment. These forecasts can make physicians better informed while making clinical decisions. In some embodiments, the model also predicts a financial cost for a current treatment, a financial risk for one or more future treatments, number of future hospital visits, hospitalization duration, health condition at the end of episode, other user-specified metrics, or a combination thereof... The method may optionally also includes determining 150 a treatment option for a patient based on the displayed one or more policies and/or pathways. For example, a physician for a sports team could use the method to determine how to help a player on the sports team recover fastest, even if it costs slightly more money than another policy.” At [0123] “ Preferably, the one or more processors are further configured to predict various outcomes, including a financial cost for a current treatment, a financial risk for one or more future treatments, number of future hospital visits, hospitalization duration, health condition at the end of the episode, other user-specified outcomes or a combination thereof.”); 
predicting a time point when there is a maximum value of the expected value of the reward calculated based on each of the obtained next states and the treatment method as the optimized timing of treatment (at least [076, 0106, 0123] as cited above); 
updating the real-time treatment method recommendation network based on the maximum value of the expected value of the reward (at least [0123] “Preferably, the one or more processors are further configured to predict various outcomes, including a financial cost for a current treatment, a financial risk for one or more future treatments, number of future hospital visits, hospitalization duration, health condition at the end of the episode, other user-specified outcomes or a combination thereof.”; see also [0106, 0111]); and 
predicting a treatment method being output by inputting the current state of the target patient to the updated network as the optimized treatment method (at least [076, 0123, 0106]).  

In reference to claim 11:
Wang teaches: A system for exploring an optimized treatment pathway, the system comprising a processor and code configured to implement the following:
a treatment pathway exploring device configured to predict a treatment method for a target patient based on an electronic medical record (EMR) episode; and
a patient state prediction device configured to receive a current state of the target patient and the treatment method and output a next state of the target patient and a reward,
wherein the treatment pathway exploring device includes:
an episode sampling module configured to receive a virtual EMR episode, calculate a similarity between the first current state of the target patient, the target patient corresponding to the received virtual EMR episode, and the second current state of a patient, the patient corresponding to each of a plurality of EMR episodes, extract an EMR episode in which the calculated similarity is highest among the plurality of EMR episodes, and output a pair of the virtual EMR episode and the extracted EMR episode (at least [fig 1 and related text], including [046] “That input will generally relate to a single health-related episode, and could be, for example, a clinical record, or medical claims records. While in some cases, records from multiple patients or doctors may be input from a remote computing device, in other instances, the input received from a remote computing device consists of health-related information of a single patient.” At [048-049] “ the second step is modeling a decision-making process as a probabilistic dynamic state-transition process comprising a series of states and actions, and generate a kernel function that captures the similarity between any two states, which is an approximate divergence function between their future distributions of pathways (it can be viewed as a restricted diffusion distance customized to the specific probabilistic process and computed from data), based on sequential clinical records…sequential clinical records will contain records from multiple patients (preferably at least 100 patients, more preferably at least 1,000 patients who went through identical or highly similar treatment episodes, at [0187-0188, fig 7B and related text] “state- action” pairs are collected/extracted);
a state value evaluation module configured to predict an expected value of a reward when performing a specific treatment method for the first current state of the target patient, based on the pair of the virtual EMR episode and the extracted EMR episode (at least [fig 1 and related text including 052] “The method also includes determining 120 an empirical transition model and an empirical reward function based on the sequential clinical records and providing s
Read full office action
Prosecution Timeline

Jun 30, 2023
Application Filed
Feb 03, 2025
Applicant Interview (Telephonic)
Feb 03, 2025
Examiner Interview Summary
Mar 21, 2025
Applicant Interview (Telephonic)
Mar 22, 2025
Examiner Interview Summary
Mar 28, 2025
Response after Non-Final Action
May 31, 2025
Non-Final Rejection — §101, §103
Sep 02, 2025
Response Filed
Dec 16, 2025
Final Rejection — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/075,541
Patent 12499467
PREDICTING THE EFFECTIVENESS OF A MARKETING CAMPAIGN PRIOR TO DEPLOYMENT
2y 5m to grant Granted Dec 16, 2025
17/582,882
Patent 12462273
SYSTEM AND METHOD FOR USING DEVICE DISCOVERY TO PROVIDE ADVERTISING SERVICES
2y 5m to grant Granted Nov 04, 2025
18/029,948
Patent 12462938
MACHINE-LEARNING MODEL FOR GENERATING HEMOPHILIA PERTINENT PREDICTIONS USING SENSOR DATA
2y 5m to grant Granted Nov 04, 2025
17/310,176
Patent 12444507
BAYESIAN CAUSAL INFERENCE MODELS FOR HEALTHCARE TREATMENT USING REAL WORLD PATIENT DATA
2y 5m to grant Granted Oct 14, 2025
18/180,325
Patent 12437315
SYSTEMS AND METHODS FOR DYNAMICALLY DETERMINING EVENT CONTENT ITEMS
2y 5m to grant Granted Oct 07, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
26%
Grant Probability
60%
With Interview (+33.6%)
4y 3m
Median Time to Grant
Moderate
PTA Risk
Based on 358 resolved cases by this examiner. Grant probability derived from career allow rate.