Prosecution Insights
Last updated: May 29, 2026
Application No. 17/808,181

UNDERSTANDING REINFORCEMENT LEARNING POLICIES BY IDENTIFYING STRATEGIC STATES

Non-Final OA §101§103
Filed
Jun 22, 2022
Examiner
BEAN, GRIFFIN TANNER
Art Unit
2121
Tech Center
2100 — Computer Architecture & Software
Assignee
International Business Machines Corporation
OA Round
2 (Non-Final)
23%
Grant Probability
At Risk
2-3
OA Rounds
5m
Est. Remaining
42%
With Interview

Examiner Intelligence

Grants only 23% of cases
23%
Career Allowance Rate
6 granted / 26 resolved
-31.9% vs TC avg
Strong +19% interview lift
Without
With
+19.0%
Interview Lift
resolved cases with interview
Typical timeline
4y 5m
Avg Prosecution
21 currently pending
Career history
65
Total Applications
across all art units

Statute-Specific Performance

§101
9.7%
-30.3% vs TC avg
§103
83.5%
+43.5% vs TC avg
§102
5.7%
-34.3% vs TC avg
§112
1.1%
-38.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 26 resolved cases

Office Action

§101 §103
DETAILED ACTION This Action is responsive to claims filed 08/22/2025. Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Information Disclosure Statement The information disclosure statement (IDS) submitted on 06/17/2025 was filed after the mailing date of the Non-Final Office Action on 05/22/2025. The submission is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner. Status of the Claims Claims 1, 8, and 15 have been amended. Claims 1-20 are pending. Response to Arguments Applicant's arguments, see Pages 10-12, filed 08/22/2025, regarding the 35 U.S.C. 101 Rejection of claims 1-20 have been fully considered but they are not persuasive. The Applicant argues the newly amended limitations of “generating…” and “creating…” integrate the interpretable abstract idea mental process step(s) into a practical application. The Examiner respectfully disagrees with the Applicant. The “generating…” step is practically performed with in the human mind or with the aid of pen and paper (a human mind could imagine said visualization or draw it with pen and paper). The “creating…” step, under the broadest reasonable interpretation, is also practically performed within the human mind or with the aid of pen and paper (a human mind is equipped with the ability to write out an HTML document with pen and paper, for example). Forgoing this interpretation in the context of the claim, the claim language does not link the generated HTML files to the visualization (“with” could mean “of the visualization” or “alongside the visualization”). The limitation only recites generic computer components performed generic computing activity, amounting to instructions to apply the preceding abstract idea mental process steps. The “creating…” limitation, therefore, is interpreted as mere post-solution/WURC activity, and therefore cannot integrate the abstract idea into a practical application or be significantly more. See the updated 35 U.S.C. 101 Rejection below. Applicant’s arguments, see Pages 12-15, filed 08/22/2025, regarding the 35 U.S.C. 103 Rejection(s) of claims 1-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. Claim Rejections - 35 USC § 101 The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action. Claims 1-20 rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception (i.e., a law of nature, a natural phenomenon, or an abstract idea) without significantly more; and because the claims as a whole, considering all claim elements both individually and in combination, do not amount to significantly more than the abstract idea, see Alice Corporation Pty. Ltd. v. CLS Bank International, et al, 573 U.S. (2014). In determining whether the claims are subject matter eligible, the Examiner applies the 2019 USPTO Patent Eligibility Guidelines. (2019 Revised Patent Subject Matter Eligibility Guidance, 84 Fed. Reg. 50, Jan. 7, 2019.) Step 1: Claims 1-7 recite a computer-implemented method, which falls under the statutory category of a process. Claims 8-14 recite a computer program product, which falls under the statutory category of a manufacture. Claims 15-20 recite a computer system, which falls under the statutory category of a machine. Step 2A – Prong 1: Claim 1 recites an abstract idea, law of nature, or natural phenomenon. The limitations of “computing, by one or more computer processors, a maximum likelihood path matrix comprising a respective shortest path between each state in a set of states associated with a model trained with a deep reinforcement learning policy;”, “identifying, by one or more computer processors, strategic states that lie on a plurality of out-paths from a respective meta-state while favoring states that are further away from each other;”, “generating, by one or more computer processors, explanations for the deep reinforcement learning policy based one or more identified meta-states for each state in the set of states and corresponding selected strategic states utilizing the computed maximum likelihood path matrix;”, and “generating, by one or more computer processors, a visualization of clustered strategic states according to policy dynamics by the maximum likelihood path matrix;” under the broadest reasonable interpretation, cover a mental process including an observation, evaluation, judgment or opinion that could be performed in the human mind or with the aid of pencil and paper. Computing a matrix based on shortest paths is practically performed within the human mind or with the aid of pen and paper. Identifying states that lie on out-paths is practically performed within the human mind or with the aid of pen and paper. Generating explanations for policies is practically performed within the human mind or with the aid of pen and paper. Generating a visualization is practically performed within the human mind or with the aid of pen and paper. Step 2A – Prong 2: The additional elements of claim 1 do not integrate the abstract idea into a judicial exception. The claim recites the additional elements “A computer-implemented method”, “computer processors”, “matrix”, and “hypertext markup language files” are recognized as generic computer components recited at a high level of generality. Although they have and execute instructions to perform the abstract idea itself, this also does not serve to integrate the abstract idea into a practical application as it merely amounts to instructions to "apply it." (See MPEP 2106.04(d)(2) indicating mere instructions to apply an abstract idea does not amount to integrating the abstract idea into a practical application). The additional elements of “a set of states”, “a model trained with a deep reinforcement learning policy”, “explanations for the deep reinforcement learning policy”, “identified meta-states for each state in the set of states”, and “strategic states” are recognized as non-generic computer components, but are recited at a high level of generality and are found to generally link the abstract idea to a particular technological environment or field of use (See MPEP 2106.05(h)). The additional elements recited in the limitation “creating, by one or more computer processors, one or more hypertext markup language files with the visualization.” Is found to be mere post-extra-solution activity and does not integrate the abstract idea mental process steps into a practical application (See MPEP 2106.05(g)(iii) first list). Step 2B: The only limitation on the performance of the described method is a limitation reciting “A computer-implemented method”, “computer processors”, “matrix”, and “hypertext markup language files” These elements are insufficient to transform a judicial exception to a patentable invention because the recited elements are considered insignificant extra-solution activity (generic computer system, processing resources, links the judicial exception to a particular, respective, technological environment). The claim thus recites computing components only at a high-level of generality such that it amounts to no more than mere instructions to apply the exception using generic computer components; mere instructions to apply an exception using a generic computer component cannot provide an inventive concept (see MPEP 2106.05(f)). The additional elements of “a set of states”, “a model trained with a deep reinforcement learning policy”, “explanations for the deep reinforcement learning policy”, “identified meta-states for each state in the set of states”, and “strategic states” are recognized as non-generic computer components, but are recited at a high level of generality and are found to generally link the abstract idea to a particular technological environment or field of use (See MPEP 2106.05(h)). The additional elements recited in the limitation “creating, by one or more computer processors, one or more hypertext markup language files with the visualization.” Is found to be mere well-understood, routine, or conventional activity and does not amount to significantly more (See MPEP 2106.05(d)(II)(iv) third list). Taken alone or in ordered combination, these additional elements do not amount to significantly more than the above-identified abstract idea. There is no indication that the combination of elements improves the functioning of a computer or improves any other technology. Their collective functions merely provide conventional computer implementation. For the reasons above, claim 1 is rejected as being directed to non-patentable subject matter under §101. This rejection applies equally to independent claims 8 and 15. Claim 8 recites similar limitations to claim 1, with the exception of “A computer program product comprising: one or more computer readable storage media and program instructions stored on the one or more computer readable storage media, the stored program instructions comprising: program instructions to…” (generic computer components). Claim 15 recites similar limitations to claim 1, with the exception of “A computer system comprising: one or more computer processors; one or more computer readable storage media; and program instructions stored on the computer readable storage media for execution by at least one of the one or more processors, the stored program instructions comprising…” (generic computer components). Dependent Claims: Claim 2 (claims 9 and 16) recites abstract idea mental process steps “identifying the one or more meta-states”, “computing, by one or more computer processors, an eigen representation of each state from eigen decomposition of matrix;”, “randomly assigning, by one or more computer processors, each state to a meta-state;”, and “computing, by one or more computer processors, a centroid for each assigned state and meta-state.” The additional elements “eigen representation of each state from eigen decomposition of matrix;” and “a centroid” are recognized as non-generic computer components, but are recited at a high level of generality and are found to generally link the abstract idea to a particular technological environment or field of use (See MPEP 2106.05(h)). Claim 3 (claims 10 and 17) recites “optimizing, by one or more computer processors, the one or more identified meta-states until convergence.” This limitation has been evaluated under Steps 2A – Prong 2 and 2B and found to be instructions to apply said abstract idea mental process steps (See MPEP 2106.05(f)). Claim 4 (claims 11 and 18) recites abstract idea mental process steps “identified by aggregation based on locality of the states determined by reinforcement learning policy dynamics.” Claim 5 (claims 12 and 19) recites abstract idea mental process steps “selecting one or more identified strategic states for each identified meta-state employs a greedy selection algorithm.” Claim 6 (claims 13 and 20) recites abstract idea mental process steps “identifying, by one or more computer processors, one or more bottleneck states that go to different highly rewarding parts of a state space from a particular meta-state while balancing a selection of bottleneck states to be diverse.” Claim 7 (claim 14) recites abstract idea mental process steps “generating, by one or more computer processors, a visualization of the identified meta- states and strategic states according to deep reinforcement learning policy dynamics.” Claim Rejections - 35 USC § 103 The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action. The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows: 1. Determining the scope and contents of the prior art. 2. Ascertaining the differences between the prior art and the claims at issue. 3. Resolving the level of ordinary skill in the pertinent art. 4. Considering objective evidence present in the application indicating obviousness or nonobviousness. This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention. Claim(s) 1-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wang, Mengdi (US 2020/0373017 A1), hereinafter Wang; Dalli et al. (US 2022/0147876 A1), hereinafter Dalli; and Chen et al. (Programing by Demonstration: Coping with Suboptimal Teaching Actions, 2003), hereinafter Chen. In regards to claim 1: The present invention claims: “A computer-implemented method comprising: computing, by one or more computer processors, a maximum likelihood path matrix comprising a respective shortest path between each state in a set of states associated with a model trained with a deep reinforcement learning policy;” Wang teaches “The disclosed system and method can be used to allow health care professionals to determine a treatment pathway in the health care field, based on the RL algorithm's optimal policies.” ([0044], mapping to “associated with a model trained with a deep reinforcement learning policy;”); “Optionally, the policies that are graphically displayed will include at least one policy having the lowest overall cost or a policy where the cost of each claim is below a predetermined threshold.” ([0012], mapping the reduction of cost between states/actions to “comprising a respective shortest path between each state in a set of states”); and “The first step is to compute the empirical transition frequency matrix…from the data set. All the claims are ordered according to claim id in each episode. Every two consecutive claims gives a state-transition pair (s, s'). The entry Fss' is computed to be the frequency for (s, s'). to appear in the entire data set. After obtaining F, one can normalize it into an empirical transition matrix P, such that each row is nonnegative and sums to 1.” ([0087], mapping to “a maximum likelihood path matrix”); “…based one or more identified meta-states for each state in the set of states and corresponding selected strategic states utilizing the computed maximum likelihood path matrix.” Wang teaches “The optimal partition of state space…is computed such that…In this way, one can aggregate states into clusters and find one best action for all states belonging to the same cluster.” ([0088], mapping clusters to “meta-states”, and a one best action for all states in a cluster to a “strategic state”). “generating, by one or more computer processors, a visualization of clustered strategic states according to policy dynamics by the maximum likelihood path matrix;” Wang teaches “The devices are also configured to graphically or numerically display the one or more policies relating to the health-related episode.” ([0010]) ([0012], [0111], etc. for more about displaying the policies generated, mapping the display of policies or treatment path(s) to a visualization of related or clustered strategic states based on the selected path). “creating, by one or more computer processors, one or more hypertext markup language files with the visualization.” See above where Wang teaches the policies being displayed by devices, and [0126]-[0127] particularly for further description on the nature of the devices and displaying of the policies. The Examiner submits a person of ordinary skill in the art at the time of Wang’s filing would have known such visualizations would be displayed by the reading/execution of HTML files, and a cursory search indicates it would have been reasonable to use such HTML files given the above method is implemented on network-communicable devices running network-communicable applications. While Wang teaches calculating a shortest path to a given state, Wang fails to explicitly teach “identifying, by one or more computer processors, strategic states that lie on a plurality of out-paths from a respective meta-state while favoring states that are further away from each other;” However, Chen, in a similar field of endeavor of identifying a shortest path to complete a task teaches “The second requirement for accurate parameter estimation is a good range of data, i.e., that the demonstrator traces out paths over a wide range on the C-surface. It turned out that this was the reason for less accurate estimates in our case. There were two reasons why a path of limited range may be traced out on a c∗m in the demonstration. The first was because the c∗m was only briefly visited, e.g., in the spindle-assembly task—c∗3 , c∗4 . For example, c∗ 4 was briefly visited; existing in only two states (20 and 22) in demonstration 1. This has resulted in the less accurate estimates shown for c∗ 4 in the table. What is required for this type of c∗ m is a larger demonstration set so that more paths on distinct parts of the C-surface become available. That is, the demonstration must contain sufficient information about a region in C-space if our method is to determine an accurate representation of the region.” (Page 307, mapping the more paths on a C-surface to identifying out-paths further apart). Further explanation of the C-space/surface can be found on Page 302, right column and Page 303. Chen demonstrates in the above citation the need to explore further or a wider range of states to improve the accuracy of a state-based model. It would have been obvious to one of ordinary skill in the art at the time of the Applicant’s filing, knowing the demonstrable improvements of to state representation accuracy highlighted by Chen, that identifying and/or exploring out-paths over a wide range of states would be beneficial in a system such as Wang’s. While Wang teaches “The devices are also configured to graphically or numerically display the one or more policies relating to the health-related episode.” ([0010]) ([0012], [0111], etc. for more about displaying the policies generated), Wang fails to explicitly teach: “generating, by one or more computer processors, explanations for the deep reinforcement learning policy…” However, Dalli, in a similar field of reinforcement learning, teaches “XRL introduces the concept of explanations as part of the RL agent model and optionally the world/environment model. An exemplary XRL agent may incorporate explanations as part of its state space and/or its action space, giving it a richer exploratory space that combines agent generated and learnt explanations with environmentally derived and learnt explanations.” ([0042]) ([0044]-[0046] for more indications of an RL system using or generating explanations). Dalli highlights the usefulness in RL systems being used for finding an optimal policy based on the reward ([0002]) and uses generated explanations in doing so ([0039]). Wang also utilizes reinforcement learning to optimize the cost to a patient in their medical treatment ([0012]). It would have been obvious to one of ordinary skill in the art at the time of the applicant’s filing to combine the system of Wang with at least one of the embodiments of Dalli to improve the optimization results of a reinforcement learning system. In regards to claim 2: The present invention claims: “The computer-implemented method of claim 1, wherein identifying the one or more meta-states for each state in the set of states, comprises: computing, by one or more computer processors, an eigen representation of each state from eigen decomposition of matrix;” Wang teaches “Eq. (3) is known as the diffusion distance. It measures the similarity between future paths of two states. One is motivated by the diffusion map approach for dimension reduction. Diffusion map refers to the leading eigenfunctions of the transfer operator of a reversible dynamical system. One can generalize it to nonreversible processes and feature spaces.” ([0156], mapping to “computing, by one or more computer processors, an eigen representation of each state”). “randomly assigning, by one or more computer processors, each state to a meta-state; and computing, by one or more computer processors, a centroid for each assigned state and meta-state.” Wang teaches “This still yields a large state space. Further dimension reduction is desirable. The physicians were split randomly into multiple groups 1, ... , J and it was made sure that these groups have identical average cost per episode.” ([0083]) and “The contours are based on diffusion distances to the centroid in each cluster.” ([0173]). In regards to claim 3: The present invention claims: “The computer-implemented method of claim 2, further comprising: optimizing, by one or more computer processors, the one or more identified meta-states until convergence.” Wang teaches “The one or more processors then generate a sequence of converging solutions towards an optimal value function and one or more optimal decision policy that maps the optimal decision for each possible state, before obtaining one or more policies based on at least one of the converging solutions, the one or more policies can be used to prescribe actions sequentially along a series of states.” ([0008]). In regards to claim 4: The present invention claims: “The computer-implemented method of claim 1, wherein the strategic states are identified by aggregation based on locality of the states determined by reinforcement learning policy dynamics.” Wang teaches “First, it models a decision making process as a dynamic state-transition process comprising a series of states and actions, and generate a kernel function that captures the similarity between any two states, based on sequential clinical records, where each record relates to one of a plurality of episodes.” ([0008]) and “Optionally, the one or more processors are configured with automatic feature generation and feature selection based on state aggregation learning and state embedding learning, and tensor decomposition-based state-action representation learning.” ([0016]). See also Fig. 7B. In regards to claim 5: The present invention claims: “The computer-implemented method of claim 1, wherein selecting one or more identified strategic states for each identified meta-state employs a greedy selection algorithm.” While Wang fails to explicitly teach “a greedy selection algorithm,” Dalli teaches multiple embodiments of reinforcement learning using greedy algorithms in the optimization of a policy ([0022]-[0024]). In regards to claim 6: The present invention claims: “The computer-implemented method of claim 1, further comprising: identifying, by one or more computer processors, one or more bottleneck states that go to different highly rewarding parts of a state space from a particular meta-state while balancing a selection of bottleneck states to be diverse.” Wang teaches “The optimal partition of state space…is computed such that…In this way, one can aggregate states into clusters and find one best action for all states belonging to the same cluster.” ([0088], mapping said best actions in each cluster to the strategic states or “bottleneck states” as per the instant application’s description of “bottleneck states” in [0039]) In regards to claim 7: The present invention claims: “The computer-implemented method of claim 1, further comprising: generating, by one or more computer processors, a visualization of the identified meta-states and strategic states according to deep reinforcement learning policy dynamics.” Wang teaches “The devices are also configured to graphically or numerically display the one or more policies relating to the health-related episode.” ([0010]) ([0012], [0111], etc. for more about displaying the policies generated, mapping the display of policies or treatment path to a “visualization of the identified meta-states and strategic states”) In regards to claims 8-14: Claims 8-14 recite similar limitations to claims 1-7, with the exception of “A computer program product comprising: one or more computer readable storage media and program instructions stored on the one or more computer readable storage media, the stored program instructions comprising:” of claim 8, therefore both sets of claims are similarly rejected. In regards to claims 15-20: Claims 15-20 recite similar limitations to claims 1-6, with the exception of “A computer system comprising: one or more computer processors; one or more computer readable storage media; and program instructions stored on the computer readable storage media for execution by at least one of the one or more processors, the stored program instructions comprising:” of claim 15, therefore both sets of claims are similarly rejected. Conclusion Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. Any inquiry concerning this communication or earlier communications from the examiner should be directed to GRIFFIN T BEAN whose telephone number is (703)756-1473. The examiner can normally be reached M - F 7:30 - 4:30. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li Zhen can be reached at (571) 272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /GRIFFIN TANNER BEAN/Examiner, Art Unit 2121 /Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121
Read full office action

Prosecution Timeline

Show 4 earlier events
Aug 14, 2025
Examiner Interview Summary
Aug 22, 2025
Response Filed
Nov 18, 2025
Final Rejection mailed — §101, §103
Jan 05, 2026
Interview Requested
Jan 20, 2026
Applicant Interview (Telephonic)
Jan 20, 2026
Response after Non-Final Action
Jan 20, 2026
Examiner Interview Summary
May 26, 2026
Response after Non-Final Action

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12424302
ACCELERATED MOLECULAR DYNAMICS SIMULATION METHOD ON A QUANTUM-CLASSICAL HYBRID COMPUTING SYSTEM
4y 7m to grant Granted Sep 23, 2025
Patent 12314861
SYSTEMS AND METHODS FOR SEMI-SUPERVISED LEARNING WITH CONTRASTIVE GRAPH REGULARIZATION
4y 4m to grant Granted May 27, 2025
Patent 12261947
LEARNING SYSTEM, LEARNING METHOD, AND COMPUTER PROGRAM PRODUCT
4y 1m to grant Granted Mar 25, 2025
Study what changed to get past this examiner. Based on 3 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

2-3
Expected OA Rounds
23%
Grant Probability
42%
With Interview (+19.0%)
4y 5m (~5m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 26 resolved cases by this examiner. Grant probability derived from career allowance rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month