Last updated: April 19, 2026
Application No. 18/168,774
REINFORCEMENT LEARNING FOR OPTIMIZING CROSS-CHANNEL COMMUNICATIONS

Non-Final OA §101§103
Filed
Feb 14, 2023
Examiner
SMITH, KEVIN LEE
Art Unit
2122
Tech Center
2100 — Computer Architecture & Software
Assignee
Optum Inc.
OA Round
1 (Non-Final)
Interview Optional

— +18.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 134 resolved cases, 2023–2026
Examiner Intelligence

SMITH, KEVIN LEE View full profile →
Grants only 37% of cases
Career Allow Rate
49 granted / 134 resolved
-18.4% vs TC avg
Strong +18% interview lift
Without
With
+18.0%
Interview Lift
resolved cases with interview
Typical timeline
4y 8m
Avg Prosecution
45 currently pending
Career history
179
Total Applications
across all art units
Statute-Specific Performance

§101
30.7%
-9.3% vs TC avg
§103
36.4%
-3.6% vs TC avg
§102
10.1%
-29.9% vs TC avg
§112
17.3%
-22.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 134 resolved cases
Office Action

§101 §103
DETAILED ACTION
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
2.	This communication is in response to the Applicant’s submission filed 14 February 2023, where:
Claims 1-20 are pending.
Claims 1-20 are rejected.
Information Disclosure Statement
3.	An information disclosure statement was submitted on 30 May 2023. The submission complies with the provisions of 37 CFR 1.97. Accordingly, the Examiner considered the information disclosure statement.
Claim Rejections - 35 U.S.C. § 101
4.	35 U.S.C. § 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
5.	Claims 1-20 are rejected under 35 U.S.C. § 101 because the claimed invention is directed to an abstract idea without significantly more. 
Claim 1 recites a computer-implemented method, which is a process, and thus one of the statutory categories of patentable subject matter. (35 U.S.C. § 101). 
However, under Step 2A Prong One, the claim recites the limitations of “[(c)] transforming, by the one or more processors and using a state encoder machine learning model, the historical event sequence data into one or more sequence embeddings comprising fixed-length vectors,” and “[(d)] generating, by the one or more processors and using a predictive software agent machine learning model, a prediction output comprising one or more optimal agent actions comprising at least a best agent action based on the one or more sequence embeddings and the one or more Boolean flags.” These activities of “[(c)] transforming. . . the historical event sequence data into one or more sequence embeddings” and “[(d)] generating . . . a prediction output” are limitations that can practically be performed in the human mind, including, for example, observations, evaluations, judgments, and opinions, and accordingly, are mental processes, (MPEP § 2106.04(a)(2) sub III), which is one of the groupings of abstract ideas. (MPEP § 2106.04(a)(2)). The claim also recites more details or specifics to the abstract idea of “[(d)] generating . . . a prediction output,” where “[(d.1)](i) the best agent action comprises a highest-scoring action,” and “[(d.1)](ii) the highest-scoring action is determined from an action distribution based on an action reward value associated with the highest-scoring action,” and accordingly, are merely more specific to the abstract idea.
The claim also recites the limitations of “[(d.1)](iii) the predictive software agent machine learning model is trained over one or more training intervals,” which is not a judicial exception; however, the limitation continues reciting limitations the provide further specifics or details of “[(d.1)(iii)](b) generating an action reward value for each of the one or more training agent interactions by applying a reward function to the one or more training agent interactions,” “[(d.1)(iii)](c) for each of one or more target client computing entities, combining selected ones of the one or more training agent interactions associated with the entity into a historical episode,” “[(d.1)(iii)](e) generating a plurality of historical episode combinations, each of the plurality of historical episode combinations comprising N historical episodes and at least one most recent historical episode selected from the plurality of historical episodes,” and “[(d.1)(iii)](f) determining one or more of the plurality of historical episode combinations associated with optimal model parameters.” The activities of “[(d.1)(iii)](b) generating an action reward value,” “[(d.1)(iii)](c) combining selected ones of the one or more training agent interactions,” “[(d.1)(iii)](e) generating a plurality of historical episode combinations,” and “[(d.1)(iii)](f) determining combinations associated with optimal model parameters” are limitations that can practically be performed in the human mind, including, for example, observations, evaluations, judgments, and opinions, and accordingly, are mental processes, (MPEP § 2106.04(a)(2) sub III), which is one of the groupings of abstract ideas. (MPEP § 2106.04(a)(2)). Accordingly, claim 1 recites an abstract idea.
Under Step 2A Prong Two, the claim as a whole is not integrated into a practical application, because the additional elements recited in the claim beyond the identified judicial exception include “one or more processors” and a “historical database,” which are generic computer components used to implement the abstract idea, (MPEP § 2106.05(f)), that does not serve to integrate the abstract idea into a practical application. Also, the claim recites “a state encoder machine learning model,” and a ”predictive software agent machine learning model,” which are recited at a high-level of generality, and accordingly are generic computer components used to implement the abstract idea, (MPEP § 2106.05(f)), that does not serve to integrate the abstract idea into a practical application. The claim also recites “[(a)] receiving, by one or more processors, an agent action query comprising an entity identifier and one or more Boolean flags,” “[(b)] receiving, by the one or more processors, historical event sequence data associated with the entity identifier,” and “[(e)] initiating , by the one or more processors, the performance of the one or more optimal agent actions.” The activities of “[(a), (b)] receiving” and “[(e)] initiating” are pre-processing and post-processing insignificant extra-solution activities of data gathering, and data output, respectively, (MPEP § 2106.05(g)), that do not serve to integrate the abstract idea into a practical application. Further, the plain meaning of [(e)] initiating , by the one or more processors, the performance of the one or more optimal agent actions” includes outputting data, which is not inconsistent with the Applicant’s disclosure, (MPEP § 2111; see, e.g., Specification ¶ 0081), and accordingly, under a broadest reasonable interpretation, is directed to the post-processing insignificant extra-solution activity of data output, (MPEP § 2106.05(g)), that does not serve to integrate the abstract idea into a practical application. Also, the claim recites more details or specifics to the additional element of “[(b)] receiving” “[(b.1)] wherein the historical event sequence data comprises state representation data, agent action data, and agent interaction data,” and accordingly, is merely more specific to the additional element. 
Also, the claim recites “[(d.1)](iii) the predictive software agent machine learning model is trained over one or more training intervals,” which is the use of a generic computer component (predictive software agent machine learning model) in the ordinary and customary manner to implement the abstract idea, (MPEP § 2106.05(f)), that does not serve to integrate the abstract idea into a practical application. The limitation further recites the additional element of “[(d.1)(iii)](a) retrieving training agent interaction data comprising one or more training agent interactions associated with respective one or more training agent actions” and “[(d.1)(iii)](d) storing the historical episode to a historical database comprising a plurality of historical episodes,” which is a pre-processing insignificant extra-solution activity of data gathering, (MPEP § 2106.05(g)), that does not serve to integrate the abstract idea into a practical application. The claim also recites more details or specifics of the additional element of ““[(d.1)(iii)](a) retrieving training agent interaction data” where “the training agent interaction data recorded over a selected period of time for a respective training interval,” which merely provides more specifics to the additional element. Therefore, claim 1 is directed to the abstract idea.
Finally, under Step 2B, the additional elements, taken alone or in combination, do not represent significantly more than the abstract idea itself. The additional elements recited in the claim beyond the identified judicial exception include “one or more processors” and a “historical database,” which are generic computer components used to implement the abstract idea, (MPEP § 2106.05(f)), that does not amount to significantly more than the abstract idea. Also, the claim recites “a state encoder machine learning model,” and a ”predictive software agent machine learning model,” which are recited at a high-level of generality, and accordingly are generic computer components used to implement the abstract idea, (MPEP § 2106.05(f)), that does not amount to significantly more than the abstract idea. The claim also recites “[(a)] receiving, by one or more processors, an agent action query comprising an entity identifier and one or more Boolean flags,” and “[(b)] receiving, by the one or more processors, historical event sequence data associated with the entity identifier,” are well-understood, routine, and conventional activities of receiving or transmitting data over a network, (MPEP § 2106.05(d) sub II.i), that do not amount to significantly more than the abstract idea. Also, the claim recites “ “[(e)] initiating , by the one or more processors, the performance of the one or more optimal agent actions.” The activity of “[(e)] initiating” are well-understood, routine, and conventional activities of transmitting data over a network, (MPEP § 2106.05(d) sub II.i), because the broadest reasonable interpretation of “[(e))] initiating the performance” includes data output in the form of a report, a script, a display, etc., which is not inconsistent with the Applicant’s disclosure, (MPEP § 2111; see, e.g., Specification ¶ 0081), that does not amount to significantly more than the abstract idea. Also, the claim recites more details or specifics to the additional element of “[(b)] receiving” “[(b.1)] wherein the historical event sequence data comprises state representation data, agent action data, and agent interaction data,” and accordingly, is merely more specific to the additional element. 
Also, the claim recites “[(d.1)](iii) the predictive software agent machine learning model is trained over one or more training intervals,” which is the use of a generic computer component (predictive software agent machine learning model) in the ordinary and customary manner to implement the abstract idea, (MPEP § 2106.05(f)), that does not amount to significantly more than the abstract idea. The limitation further recites the additional element of “[(d.1)(iii)](a) retrieving training agent interaction data comprising one or more training agent interactions associated with respective one or more training agent actions” and “[(d.1)(iii)](d) storing the historical episode to a historical database comprising a plurality of historical episodes,” which are well-understood, routine, and conventional activities of retrieving and storing information in memory, (MPEP § 2106.05(d) sub II.iv), that does not amount to significantly more than the abstract idea. The claim also recites more details or specifics of the additional element of “[(d.1)(iii)](a) retrieving training agent interaction data” where “the training agent interaction data recorded over a selected period of time for a respective training interval,” which merely provides more specifics to the additional element. Therefore, claim 1 is subject-matter ineligible. 
Claim 9 recites a computing apparatus, which is a product, and thus one of the statutory categories of patentable subject matter. (35 U.S.C. § 101). 
However, under Step 2A Prong One, the claim recites the limitations of “[(c)] transform, using a state encoder machine learning model, the historical event sequence data into one or more sequence embeddings comprising fixed-length vectors,” and “[(d)] generate, using a predictive software agent machine learning model, a prediction output comprising one or more optimal agent actions comprising at least a best agent action based on the one or more sequence embeddings and the one or more Boolean flags.” These activities of “[(c)] transform . . . the historical event sequence data into one or more sequence embeddings” and “[(d)] generate . . . a prediction output” are limitations that can practically be performed in the human mind, including, for example, observations, evaluations, judgments, and opinions, and accordingly, are mental processes, (MPEP § 2106.04(a)(2) sub III), which is one of the groupings of abstract ideas. (MPEP § 2106.04(a)(2)). The claim also recites more details or specifics to the abstract idea of “[(d)] generate . . . a prediction output,” where “[(d.1)](i) the best agent action comprises a highest-scoring action,” and “[(d.1)](ii) the highest-scoring action is determined from an action distribution based on an action reward value associated with the highest-scoring action,” and accordingly, are merely more specific to the abstract idea.
The claim also recites the limitations of “[(d.1)](iii) the predictive software agent machine learning model is trained over one or more training intervals,” which is not a judicial exception; however, the limitation continues reciting limitations the provide further specifics or details of “[(d.1)(iii)](b) generate an action reward value for each of the one or more training agent interactions by applying a reward function to the one or more training agent interactions,” “[(d.1)(iii)](c) for each of one or more target client computing entities, combining selected ones of the one or more training agent interactions associated with the entity into a historical episode,” “[(d.1)(iii)](e) generating a plurality of historical episode combinations, each of the plurality of historical episode combinations comprising N historical episodes and at least one most recent historical episode selected from the plurality of historical episodes,” and “[(d.1)(iii)](f) determining one or more of the plurality of historical episode combinations associated with optimal model parameters.” The activities of “[(d.1)(iii)](b) generating an action reward value,” “[(d.1)(iii)](c) combining selected ones of the one or more training agent interactions,” “[(d.1)(iii)](e) generating a plurality of historical episode combinations,” and “[(d.1)(iii)](f) determining combinations associated with optimal model parameters” are limitations that can practically be performed in the human mind, including, for example, observations, evaluations, judgments, and opinions, and accordingly, are mental processes, (MPEP § 2106.04(a)(2) sub III), which is one of the groupings of abstract ideas. (MPEP § 2106.04(a)(2)). Accordingly, claim 9 recites an abstract idea.
Under Step 2A Prong Two, the claim as a whole is not integrated into a practical application, because the additional elements recited in the claim beyond the identified judicial exception include “computing apparatus comprising memory and one or more processors” and a “historical database,” which are generic computer components used to implement the abstract idea, (MPEP § 2106.05(f)), that does not serve to integrate the abstract idea into a practical application. Also, the claim recites “a state encoder machine learning model,” and a ”predictive software agent machine learning model,” which are recited at a high-level of generality, and accordingly are generic computer components used to implement the abstract idea, (MPEP § 2106.05(f)), that does not serve to integrate the abstract idea into a practical application. The claim also recites “[(a)] receive an agent action query comprising an entity identifier and one or more Boolean flags,” “[(b)] receive historical event sequence data associated with the entity identifier,” and “[(e)] initiate the performance of the one or more optimal agent actions.” The activities of “[(a), (b)] receiving” and “[(e)] initiating” are pre-processing and post-processing insignificant extra-solution activities of mere data gathering, and data output, respectively, (MPEP § 2106.05(g)), that do not serve to integrate the abstract idea into a practical application. Further, the plain meaning of [(e)] initiate the performance of the one or more optimal agent actions” includes outputting data, which is not inconsistent with the Applicant’s disclosure, (MPEP § 2111; see, e.g., Specification ¶ 0081), and accordingly, under a broadest reasonable interpretation, is directed to the post-processing insignificant extra-solution activity of data output, (MPEP § 2106.05(g)), that does not serve to integrate the abstract idea into a practical application. Also, the claim recites more details or specifics to the additional element of “[(b)] receive” “[(b.1)] wherein the historical event sequence data comprises state representation data, agent action data, and agent interaction data,” and accordingly, is merely more specific to the additional element. 
Also, the claim recites “[(d.1)](iii) the predictive software agent machine learning model is trained over one or more training intervals,” which is the use of a generic computer component (predictive software agent machine learning model) in the ordinary and customary manner to implement the abstract idea, (MPEP § 2106.05(f)), that does not serve to integrate the abstract idea into a practical application. The limitation further recites the additional element of “[(d.1)(iii)](a) retrieving training agent interaction data comprising one or more training agent interactions associated with respective one or more training agent actions” and “[(d.1)(iii)](d) storing the historical episode to a historical database comprising a plurality of historical episodes,” which is a pre-processing insignificant extra-solution activity of mere data gathering, (MPEP § 2106.05(g)), that does not serve to integrate the abstract idea into a practical application. The claim also recites more details or specifics of the additional element of ““[(d.1)(iii)](a) retrieving training agent interaction data” where “the training agent interaction data recorded over a selected period of time for a respective training interval,” which merely provides more specifics to the additional element. Therefore, claim 9 is directed to the abstract idea.
Finally, under Step 2B, the additional elements, taken alone or in combination, do not represent significantly more than the abstract idea itself. The additional elements recited in the claim beyond the identified judicial exception include “computing apparatus comprising memory and one or more processors” and a “historical database,” which are generic computer components used to implement the abstract idea, (MPEP § 2106.05(f)), that does not amount to significantly more than the abstract idea. Also, the claim recites “a state encoder machine learning model,” and a ”predictive software agent machine learning model,” which are recited at a high-level of generality, and accordingly are generic computer components used to implement the abstract idea, (MPEP § 2106.05(f)), that does not amount to significantly more than the abstract idea. The claim also recites “[(a)] receive an agent action query comprising an entity identifier and one or more Boolean flags,” and “[(b)] receive historical event sequence data associated with the entity identifier,” are well-understood, routine, and conventional activities of receiving or transmitting data over a network, (MPEP § 2106.05(d) sub II.i), that do not amount to significantly more than the abstract idea. Also, the claim recites “ “[(e)] initiate the performance of the one or more optimal agent actions.” The activity of “[(e)] initiating” are well-understood, routine, and conventional activities of transmitting data over a network, (MPEP § 2106.05(d) sub II.i), because the broadest reasonable interpretation of “[(e))] initiating the performance” includes data output in the form of a report, a script, a display, etc., which is not inconsistent with the Applicant’s disclosure, (MPEP § 2111; see, e.g., Specification ¶ 0081), that does not amount to significantly more than the abstract idea. Also, the claim recites more details or specifics to the additional element of “[(b)] receive” “[(b.1)] wherein the historical event sequence data comprises state representation data, agent action data, and agent interaction data,” and accordingly, is merely more specific to the additional element. 
Also, the claim recites “[(d.1)](iii) the predictive software agent machine learning model is trained over one or more training intervals,” which is the use of a generic computer component (predictive software agent machine learning model) in the ordinary and customary manner to implement the abstract idea, (MPEP § 2106.05(f)), that does not amount to significantly more than the abstract idea. The limitation further recites the additional element of “[(d.1)(iii)](a) retrieving training agent interaction data comprising one or more training agent interactions associated with respective one or more training agent actions” and “[(d.1)(iii)](d) storing the historical episode to a historical database comprising a plurality of historical episodes,” which are well-understood, routine, and conventional activities of retrieving and storing information in memory, (MPEP § 2106.05(d) sub II.iv), that does not amount to significantly more than the abstract idea. The claim also recites more details or specifics of the additional element of “[(d.1)(iii)](a) retrieving training agent interaction data” where “the training agent interaction data recorded over a selected period of time for a respective training interval,” which merely provides more specifics to the additional element. Therefore, claim 9 is subject-matter ineligible.
Claim 16 recites one or more non-transitory computer-readable storage media, which is a product, and thus one of the statutory categories of patentable subject matter. (35 U.S.C. § 101). 
However, under Step 2A Prong One, the claim recites the limitations of “[(c)] transform, using a state encoder machine learning model, the historical event sequence data into one or more sequence embeddings comprising fixed-length vectors,” and “[(d)] generate, using a predictive software agent machine learning model, a prediction output comprising one or more optimal agent actions comprising at least a best agent action based on the one or more sequence embeddings and the one or more Boolean flags.” These activities of “[(c)] transform . . . the historical event sequence data into one or more sequence embeddings” and “[(d)] generate . . . a prediction output” are limitations that can practically be performed in the human mind, including, for example, observations, evaluations, judgments, and opinions, and accordingly, are mental processes, (MPEP § 2106.04(a)(2) sub III), which is one of the groupings of abstract ideas. (MPEP § 2106.04(a)(2)). The claim also recites more details or specifics to the abstract idea of “[(d)] generate . . . a prediction output,” where “[(d.1)](i) the best agent action comprises a highest-scoring action,” and “[(d.1)](ii) the highest-scoring action is determined from an action distribution based on an action reward value associated with the highest-scoring action,” and accordingly, are merely more specific to the abstract idea.
The claim also recites the limitations of “[(d.1)](iii) the predictive software agent machine learning model is trained over one or more training intervals,” which is not a judicial exception; however, the limitation continues reciting limitations the provide further specifics or details of “[(d.1)(iii)](b) generate an action reward value for each of the one or more training agent interactions by applying a reward function to the one or more training agent interactions,” “[(d.1)(iii)](c) for each of one or more target client computing entities, combining selected ones of the one or more training agent interactions associated with the entity into a historical episode,” “[(d.1)(iii)](e) generating a plurality of historical episode combinations, each of the plurality of historical episode combinations comprising N historical episodes and at least one most recent historical episode selected from the plurality of historical episodes,” and “[(d.1)(iii)](f) determining one or more of the plurality of historical episode combinations associated with optimal model parameters.” The activities of “[(d.1)(iii)](b) generating an action reward value,” “[(d.1)(iii)](c) combining selected ones of the one or more training agent interactions,” “[(d.1)(iii)](e) generating a plurality of historical episode combinations,” and “[(d.1)(iii)](f) determining combinations associated with optimal model parameters” are limitations that can practically be performed in the human mind, including, for example, observations, evaluations, judgments, and opinions, and accordingly, are mental processes, (MPEP § 2106.04(a)(2) sub III), which is one of the groupings of abstract ideas. (MPEP § 2106.04(a)(2)). Accordingly, claim 16 recites an abstract idea.
Under Step 2A Prong Two, the claim as a whole is not integrated into a practical application, because the additional elements recited in the claim beyond the identified judicial exception include “one or more non-transitory computer-readable storage media including instructions that, when executed by one or more processors” and a “historical database,” which are generic computer components used to implement the abstract idea, (MPEP § 2106.05(f)), that does not serve to integrate the abstract idea into a practical application. Also, the claim recites “a state encoder machine learning model,” and a ”predictive software agent machine learning model,” which are recited at a high-level of generality, and accordingly are generic computer components used to implement the abstract idea, (MPEP § 2106.05(f)), that does not serve to integrate the abstract idea into a practical application. The claim also recites “[(a)] receive an agent action query comprising an entity identifier and one or more Boolean flags,” “[(b)] receive historical event sequence data associated with the entity identifier,” and “[(e)] initiate the performance of the one or more optimal agent actions.” The activities of “[(a), (b)] receiving” and “[(e)] initiating” are pre-processing and post-processing insignificant extra-solution activities of mere data gathering, and data output, respectively, (MPEP § 2106.05(g)), that do not serve to integrate the abstract idea into a practical application. Further, the plain meaning of [(e)] initiate the performance of the one or more optimal agent actions” includes outputting data, which is not inconsistent with the Applicant’s disclosure, (MPEP § 2111; see, e.g., Specification ¶ 0081), and accordingly, under a broadest reasonable interpretation, is directed to the post-processing insignificant extra-solution activity of data output, (MPEP § 2106.05(g)), that does not serve to integrate the abstract idea into a practical application. Also, the claim recites more details or specifics to the additional element of “[(b)] receive” “[(b.1)] wherein the historical event sequence data comprises state representation data, agent action data, and agent interaction data,” and accordingly, is merely more specific to the additional element. 
Also, the claim recites “[(d.1)](iii) the predictive software agent machine learning model is trained over one or more training intervals,” which is the use of a generic computer component (predictive software agent machine learning model) in the ordinary and customary manner to implement the abstract idea, (MPEP § 2106.05(f)), that does not serve to integrate the abstract idea into a practical application. The limitation further recites the additional element of “[(d.1)(iii)](a) retrieving training agent interaction data comprising one or more training agent interactions associated with respective one or more training agent actions” and “[(d.1)(iii)](d) storing the historical episode to a historical database comprising a plurality of historical episodes,” which is a pre-processing insignificant extra-solution activity of mere data gathering, (MPEP § 2106.05(g)), that does not serve to integrate the abstract idea into a practical application. The claim also recites more details or specifics of the additional element of ““[(d.1)(iii)](a) retrieving training agent interaction data” where “the training agent interaction data recorded over a selected period of time for a respective training interval,” which merely provides more specifics to the additional element. Therefore, claim 16 is directed to the abstract idea.
Finally, under Step 2B, the additional elements, taken alone or in combination, do not represent significantly more than the abstract idea itself. The additional elements recited in the claim beyond the identified judicial exception include “one or more non-transitory computer-readable storage media including instructions that, when executed by one or more processors” and a “historical database,” which are generic computer components used to implement the abstract idea, (MPEP § 2106.05(f)), that does not amount to significantly more than the abstract idea. Also, the claim recites “a state encoder machine learning model,” and a ”predictive software agent machine learning model,” which are recited at a high-level of generality, and accordingly are generic computer components used to implement the abstract idea, (MPEP § 2106.05(f)), that does not amount to significantly more than the abstract idea. The claim also recites “[(a)] receive an agent action query comprising an entity identifier and one or more Boolean flags,” and “[(b)] receive historical event sequence data associated with the entity identifier,” are well-understood, routine, and conventional activities of receiving or transmitting data over a network, (MPEP § 2106.05(d) sub II.i), that do not amount to significantly more than the abstract idea. Also, the claim recites “ “[(e)] initiate the performance of the one or more optimal agent actions.” The activity of “[(e)] initiating” are well-understood, routine, and conventional activities of transmitting data over a network, (MPEP § 2106.05(d) sub II.i), because the broadest reasonable interpretation of “[(e))] initiating the performance” includes data output in the form of a report, a script, a display, etc., which is not inconsistent with the Applicant’s disclosure, (MPEP § 2111; see, e.g., Specification ¶ 0081), that does not amount to significantly more than the abstract idea. Also, the claim recites more details or specifics to the additional element of “[(b)] receive” “[(b.1)] wherein the historical event sequence data comprises state representation data, agent action data, and agent interaction data,” and accordingly, is merely more specific to the additional element. 
Also, the claim recites “[(d.1)](iii) the predictive software agent machine learning model is trained over one or more training intervals,” which is the use of a generic computer component (predictive software agent machine learning model) in the ordinary and customary manner to implement the abstract idea, (MPEP § 2106.05(f)), that does not amount to significantly more than the abstract idea. The limitation further recites the additional element of “[(d.1)(iii)](a) retrieving training agent interaction data comprising one or more training agent interactions associated with respective one or more training agent actions” and “[(d.1)(iii)](d) storing the historical episode to a historical database comprising a plurality of historical episodes,” which are well-understood, routine, and conventional activities of retrieving and storing information in memory, (MPEP § 2106.05(d) sub II.iv), that does not amount to significantly more than the abstract idea. The claim also recites more details or specifics of the additional element of “[(d.1)(iii)](a) retrieving training agent interaction data” where “the training agent interaction data recorded over a selected period of time for a respective training interval,” which merely provides more specifics to the additional element. Therefore, claim 16 is subject-matter ineligible.
Claims 2, 7, and 8 depend directly or indirectly from claim 1. Claims 10, 14, and 15 depend directly or indirectly from claim 9. Claims 19 and 20 depend directly or indirectly from claim 16. The claims recite more details or specifics of the “predictive software agent machine learning model,” (claims 2 and 10: wherein the predictive software agent machine learning model comprises a reinforcement learning machine learning model”; claims 7, 14, and 19: wherein the predictive software agent machine learning model comprises a deep Q network”; claims 8, 15, and 20: “wherein the deep Q network comprises an exploration phase and an exploitation phase”), and accordingly, are merely more specific to the additional element. The abstract idea of these claims are not integrated into a practical application, (see MPEP § 2106.04(d)), nor do they amount to significantly more than the abstract idea, (MPEP § 2106.05 sub I; see also MPEP § 2106.05(a) – (h)), because the claims recite no more than the abstract idea. Therefore, claims 2, 7, 8, 10, 14, 15, 19 and 20 are subject-matter ineligible.
Claim 3 depends from claim 1. Claim 11 depends from claim 9. Claim 17 depends from claim 16. The claims recite further limitations of “[(f)] monitoring one or more agent interactions associated with the performance of the one or more prediction-based actions,” and “[(g)] generating the training agent interaction data based on the monitored one or more agent interactions.” These activities of “[(f)] monitoring” and “[(g)] generating the training agent interaction data” are limitations that can practically be performed in the human mind, including, for example, observations, evaluations, judgments, and opinions, and accordingly, are mental processes, (MPEP § 2106.04(a)(2) sub III), which is one of the groupings of abstract ideas. (MPEP § 2106.04(a)(2)). Also, because the broadest reasonable interpretation of “[(g)] generating” covers the selecting and/or collecting of information from the “[(f)] monitoring,” which is not inconsistent with the Applicant’s disclosure, (MPEP § 2111), the activity of “[(g)] generating” is a limitation that can practically be performed in the human mind. The abstract idea of these claims are not integrated into a practical application, (see MPEP § 2106.04(d)), nor do they amount to significantly more than the abstract idea, (MPEP § 2106.05 sub I; see also MPEP § 2106.05(a) – (h)), because the claims recite no more than the abstract idea. Therefore, claims 3, 11, and 17 are subject-matter ineligible.
Claim 4 depends from claim 1. The claim recites further limitations of “[(f)] filtering agent actions from the action distribution based on the Boolean flags.” The activity of “[(f)] filtering agent actions” is a limitation that can practically be performed in the human mind, including, for example, observations, evaluations, judgments, and opinions, and accordingly, a mental process, (MPEP § 2106.04(a)(2) sub III), which is one of the groupings of abstract ideas. (MPEP § 2106.04(a)(2)). The abstract idea of these claims are not integrated into a practical application, (see MPEP § 2106.04(d)), nor do they amount to significantly more than the abstract idea, (MPEP § 2106.05 sub I; see also MPEP § 2106.05(a) – (h)), because the claims recite no more than the abstract idea. Therefore, claim 4 is subject-matter ineligible.
Claim 5 depends from claim 1. Claim 12 depends from claim 9. Claim 18 depends from claim 16. The claims recite more details or specifics to the abstract idea of “[(c)] transforming the state representation into one or more sequence embeddings,” by “[(c.1)] tokenizing the historical event sequence data,” and “[(c.2)] normalizing the tokenized historical event sequence data into the fixed-length vectors,” which are merely more specific to the abstract idea. Also, the activities of “[(c.1)] tokenizing,” and “[(c.2)] normalizing the tokenized historical event sequence data into the fixed-length vectors,” are converting information from one format to another, and thus are limitations that can practically be performed in the human mind, including, for example, observations, evaluations, judgments, and opinions, and accordingly, a mental process, (MPEP § 2106.04(a)(2) sub III), which is one of the groupings of abstract ideas. (MPEP § 2106.04(a)(2)). Therefore, claims 5, 12, and 18 are subject-matter ineligible.
Claim 6 depends from claim 1. Claim 13 depends from claim 9. The claims recite more details or specifics to the abstract idea of [(d)] generating . . . a prediction output,” “[(d.1)](ii) wherein the action distribution comprises a set of one or more agent actions associated with respective action reward values,” and accordingly, are merely more specific to the abstract idea. The abstract idea of these claims are not integrated into a practical application, (see MPEP § 2106.04(d)), nor do they amount to significantly more than the abstract idea, (MPEP § 2106.05 sub I; see also MPEP § 2106.05(a) – (h)), because the claims recite no more than the abstract idea. Therefore, claims 6 and 13 are subject-matter ineligible.
Claim Rejections – 35 U.S.C. § 103
6.	The following is a quotation of 35 U.S.C. § 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
7.	The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. § 103 are summarized as follows:
1. 	Determining the scope and contents of the prior art.
2. 	Ascertaining the differences between the prior art and the claims at issue.
3. 	Resolving the level of ordinary skill in the pertinent art.
4. 	Considering objective evidence present in the application indicating obviousness or nonobviousness.
8.	This application currently names joint inventors. In considering patentability of the claims the Examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the Examiner to consider the applicability of 35 U.S.C. § 102(b)(2)(C) for any potential 35 U.S.C. § 102(a)(2) prior art against the later invention.
9.	Claims 1-20 are rejected under 35 U.S.C. § 103 as being unpatentable over US Published Application 20210150417 to Fadel et al. [hereinafter Fadel] in view of US Published Application 20220058345 to Guo et al. [hereinafter Guo].
Regarding claims 1, 9, and 16, Fadel teaches [a] computer-implemented method (Fadel, Abstract, teaches “method for reinforcement machine learning uses a reinforcement learning system that has an environment and an agent”) of claim 1, [a] computing apparatus comprising memory and one or more processors (Fadel ¶ 0079 teaches “a system is provided that includes one or more processors which alone or in combination, are configured to provide for execution of a method for reinforcement machine learning using a reinforcement learning system”) of claim 9, and [o]ne or more non-transitory computer-readable storage media (Fadel ¶ 0182 teaches “[e]xamples of memory 604 include a non-transitory computer-readable media) of claim 16, comprising:
[(a)] receiving, by one or more processors (Fadel ¶ 0079 teaches “a system is provided that includes one or more processors which alone or in combination, are configured to provide for execution of a method for reinforcement machine learning using a reinforcement learning system [(that is, by one or more processors)]”), an agent action query comprising an entity identifier and one or more Boolean flags (Fadel, Fig. 2, teaches a weakly supervised reinforcement learning system 200 [Examiner annotations in dashed-line text boxes]:

    PNG
    media_image1.png
    756
    1024
    media_image1.png
    Greyscale

Fadel ¶ 0108 teaches “the RL system 200 can be initialized by building the problem to be solved into the framework of the RL system 200 . . . by defining elements of the RL system 200, including defining the agent 240 (including its policy) [(that is, an entity identifier)]”; Fadel ¶ 0110 teaches “constraining functions 220 are programmable knowledge functions that constrain the behavior of the agent 240. Each constraining function ƒi(⋅) takes as its input the state of the environment 250 at time step t, and outputs an action mask . . . in the form of a vector with an entry for each action in the action space, where the value of each entry is either a 1 or 0 [(that is, an “action mask” is one or more Boolean flags)]”); further, Fadel ¶ 0118 teaches “the M vectors from the constraining functions 220, the N vectors from the guide functions 230, and the vector from the agent 240 are sent to [(that is, receiving)] an ensemble 260 [(that is, receiving . . . an agent action query comprising an entity identifier and one or more Boolean flags)]”;
[Examiner notes that the plain meaning of the term “Boolean flag” is a variable that represents a state with two possible values: TRUE or FALSE, and may be presented in forms such as a flag register (or vector), a mask, a filter, etc.; accordingly, the broadest reasonable interpretation of the term “one or more Boolean flags” covers the teachings of Fadel, which is not inconsistent with the Applicant’s disclosure, (MPEP § 2111)]);
[(b)] receiving, by the one or more processors, historical event sequence data associated with the entity identifier (Fadel ¶ 0032 teaches “to leverage preexisting domain knowledge to constrain and/or guide the RL agent's [(that is, entity identifier)] actions, a human expert may program knowledge functions that encapsulate learned constraints and guidance [(that is, “learned constraints and guidance” is receiving . . . historical event sequence data associated with the entity identifier)]”), 
[(b.1)] wherein the historical event sequence data comprises state representation data, agent action data, and agent interaction data (Fadel ¶¶ 0101-02 teaches “[t]he goal of the agent 110 is to find a policy that maximizes a cumulative reward in the long run so that the value function can be used to determine the selected action for each state. To do this, the agent 110 learns from its experience [(that is, the historical event sequence data)] for each performed action at [(that is, agent action data)], and then uses the collected observations (st+1, rt+1) [(that is, “st+1” is state representation data) and reward “rt+1” is agent interaction data)] [for each performed action at] to optimize its policy π based on different models of the value function, such as a Tabular model (see Sutton) or a deep neural network model (see Mnih)”);
[(c)] transforming, by the one or more processors and using a state encoder machine learning model (Fadel ¶ 0171 & Fig. 4 teaches an “embodiment of Tutor4RL was implemented by modifying the Deep Q-Networks (DQN) agent (see Mnih), using the library Keras-RL (see Plappert et al, keras-r1 (2016) available at github.com, the entire contents of which is hereby incorporated by reference herein) along with Tensorflow [(that is, a state encoder machine learning model)]”), the historical event sequence data into one or more sequence embeddings comprising fixed-length vectors (Fadel, Fig. 4, teaches a system that possesses external knowledge and interacts with the agent during training [Examiner annotations in dashed-line text boxes]:

    PNG
    media_image2.png
    658
    861
    media_image2.png
    Greyscale

Fadel ¶ 0162 teaches the “tutor 410 takes as an input the state s of the environment 430 and outputs the action a to take, in a similar way to the agent's policy. In the embodiment shown in FIG. 4, the tutor 410 is passed the state s (and potentially the reward r) by the agent 420. The tutor is also shown as sending its output (i.e., the choses action a) to the agent 420”; Fadel ¶ 0037 teaches “[t]he agent has a policy providing a mapping between states of the environment and actions”; Fadel ¶ 0165 and Fig. 4 teaches “[t]he tutor 410 is implemented using programmable functions, in which external knowledge is used to decide the mapping between states s and actions a. These programmable functions have been referred to as knowledge functions herein”; Fadel ¶ 0167 teaches “[a]t each time step t, a constrain function takes the state s of the environment 430 as and input, and then returns a vector to indicate whether an action ai in the action space A could be taken or not using the value 1 or 0 (1 representing the action is enabled, and 0 representing the action is disabled and cannot be performed for this state s). Thus, the tutor 410 can implement constrain functions to provide a mask for avoiding unnecessary actions for certain states”; Fadel ¶ 0111 teaches “each constraining function is defined by creating an action mask (e.g., in the form of a vector of the actions, each with an enabled/disabled indication) for a particular state where it is known that certain actions are impossible or otherwise ineffective [(that is, one or more sequence embeddings comprising fixed length vectors)]”;
[Examiner notes that a vector inherently has a fixed length, and accordingly, the broadest reasonable interpretation of the term “fixed length vectors” covers the teachins of Fadel, which is not inconsistent with the Applicant’s disclosure. (MPEP § 2111)]);
[(d)] generating, by the one or more processors and using a predictive software agent machine learning model (Fadel ¶ 0153 teaches “[o]nce the model is instantiated and the knowledge functions are initialized, the RL system [(that is, “RL system” is using a predictive software agent machine learning model)] can begin operation”), a prediction output comprising one or more optimal agent actions comprising at least a best agent action (Fadel ¶ 0154 teaches “[a]fter observing the environment, the model applies those observations (Operation 304). Applying the observations can include the agent using its policy to determine its output based on the observed state (with or without the observed reward). As described herein, the output of the agent's policy can include a selection of an action predicted [(that is, a prediction output)] to obtain the highest reward [(that is, with an action obtaining a highest reward is a prediction output comprising one or more optimal agent actions comprising at least a best agent action)]”) based on the one or more sequence embeddings (Fadel, claim 1, teaches “the agent having a policy providing a mapping between states of the environment and actions [(that is, the “policy providing a mapping” is a prediction output . . . based on the one or more sequence embeddings )]”) and the one or more Boolean flags (Fadel ¶ 0154 teaches that “when the knowledge function is a constraining function, an action mask may be generated as its output [(that is, the “action mask” is a prediction output . . . based on the one or more Boolean flags)], and when the knowledge function is a guide function, an action rating may be generated as its output”), 
[(d.1)] wherein: 
[(d.1)](i) the best agent action comprises a highest-scoring action (Fadel ¶ 0095 teaches “[t]he values that a guide function outputs can be interpreted as the expected reward [(that is, a scoring action)] after performing each action, (e.g., similar to the values that the agent's policy outputs). In the end, the best action is the action with the highest value [(that is, the best agent action comprises a highest-scoring action)]”), 
[(d.1)](ii) the highest-scoring action is determined from an action distribution based on an action reward value associated with the highest-scoring action (Fadel ¶ 0095 teaches “[g]uide functions are also programmable knowledge functions and express guidelines for the agent's behavior. These functions take, as input, the current RL state and reward, and output a vector of size L [(that is, the “vector” is an action distribution based on an action reward value)] , assigning a value that represents how ‘good’ is each action. . . . In the end, the best action is the action with the highest value [(that is, the highest-scoring action is . . . an action reward value associated with the highest-scoring action)]”),
[(d.1)](iii) the predictive software agent machine learning model is trained over one or more training intervals (Fadel ¶ 0169 teaches “[f]irst, during training, the tutor 410 enables a reasonable performance by the agent 420 (as compared to an unreliable performance from an inexperienced agent), while generating experience for training [(that is, the predictive software agent machine learning model is trained over one or more training intervals)]”) by: 
* * *
[(e)] initiating, by the one or more processors, the performance of the one or more optimal agent actions (Fadel ¶ 0121 teaches “the action with the highest value is chosen and performed by the agent 240. After the action is chosen and performed, the environment 250 reacts to the action, including possibly changing its state and giving a reward to the agent 240 [(that is, initiating, by the one or more processors, the performance of the one or more optimal agent actions)]”).
Though Fadel teaches that a tutor guides an agent to make informed decisions during training of a reinforcement learning model, Fadel, however, does not explicitly teach-
* * *
[(d.1)(iii) the predictive software agent machine learning model is trained over one or more training intervals by:]
[(d.1)(iii)](a) retrieving training agent interaction data comprising one or more training agent interactions associated with respective one or more training agent actions, 
the training agent interaction data recorded over a selected period of time for a respective training interval,
[(d.1)(iii)](b) generating an action reward value for each of the one or more training agent interactions by applying a reward function to the one or more training agent interactions,
[(d.1)(iii)](c) for each of one or more target client computing entities, combining selected ones of the one or more training agent interactions associated with the entity into a historical episode,
[(d.1)(iii)](d) storing the historical episode to a historical database comprising a plurality of historical episodes,
[(d.1)(iii)](e) generating a plurality of historical episode combinations, each of the plurality of historical episode combinations comprising N historical episodes and at least one most recent historical episode selected from the plurality of historical episodes, and
[(d.1)(iii)](f) determining one or more of the plurality of historical episode combinations associated with optimal model parameters; and
* * *
But Guo teaches –
* * *
[(d.1)(iii) the predictive software agent machine learning model is trained over one or more training intervals by:]
[(d.1)(iii)](a) retrieving training agent interaction data (Guo ¶ 0060 teaches “[c]ollected interaction data may be stored and resampled in updating the models [(that is, to “update” is retrieving training agent interaction data)]”) comprising one or more training agent interactions associated with respective one or more training agent actions (Guo ¶ 0030 teaches “[a] machine learning model, for example, a neural network 114 is trained, based on observations and templates, to compute a template-aware observation representation and find a best fit response [(that is, one or more training agent interactions associated with respective one or more training agent actions)]”), 
the training agent interaction data recorded over a selected period of time for a respective training interval (Guo ¶ 0029 teaches “the number or window of past historical observations to search from which to retrieve the relevant historical observations can be predefined [(that is, the training agent interaction data recorded over a selected period of time for a respective training interval)]”),
[(d.1)(iii)](b) generating an action reward value for each of the one or more training agent interactions (Guo ¶ 0031 teaches a “[r]eward for the selected action 120 can be fed back to the neural network 114 as a feedback, based on which the neural network 114 retrains itself [(that is, generating an action reward value for each of the one or more training agent interactions)]”) by applying a reward function to the one or more training agent interactions (Guo ¶ 0038 teaches “the task of the game playing can be formulated to generate a textual action command per step as to maximize the expected cumulative discounted rewards 

    PNG
    media_image3.png
    24
    102
    media_image3.png
    Greyscale

[(that is, by applying a reward function to the one or more training agent interactions)]”),
[(d.1)(iii)](c) for each of one or more target client computing entities, combining selected ones of the one or more training agent interactions associated with the entity into a historical episode (Guo ¶ 0024 & Fig. 1 teaches “action value prediction (e.g., predicting the long-term rewards of taking an action) includes generating and scoring a compositional action structure by finding supporting evidence from an observation [Examiner annotations in dashed-line text boxes]”:

    PNG
    media_image4.png
    731
    1257
    media_image4.png
    Greyscale

Guo, Abstract, teaches “[e]ntities in the current observation are extracted [(that is, for each of one or more target client computing entities)]. A relevant historical observation is retrieved, which has at least one of the entities in common with the current observation. The current observation and the relevant historical observation are combined as observations [(that is, for each of one or more target client computing entities, combining selected ones of the one or more training agent interactions associated with the entity into a historical episode)]”; further, Guo ¶ 0031 teaches “[t]he neural network 114, for example, takes as input the observations 110 and a template [action set] list 112 [(that is, the “observations” and “template” form the one or more training agent interactions)]”),
[(d.1)(iii)](d) storing the historical episode to a historical database comprising a plurality of historical episodes (Guo ¶ 0029 teaches “Object-based past observation retrieval 108 retrieves potential relevant historical observations and associated responses. For example, the processor retrieves from a database or a data store that stores historical observations 106 [(that is, storing the historical episode to a historical database comprising a plurality of historical episodes)]”),
[(d.1)(iii)](e) generating a plurality of historical episode combinations, each of the plurality of historical episode combinations comprising N historical episodes and at least one most recent historical episode (Guo Fig. 1 teaches observation0 and action0 [(that is, at least one most recent historical episode)]) selected from the plurality of historical episodes (Guo ¶ 0029 & Fig. 1 teaches “[o]bject-based past observation retrieval 108 retrieves potential relevant historical observations and associated responses. For example, the processor retrieves from a database or a data store that stores historical observations 106 [(that is, N historical episodes)], those historical observations that have the entities extracted at 104 from the given observation 102. In an embodiment, the number or window of past historical observations to search from which to retrieve the relevant historical observations can be predefined. The current observation and the retrieved historical observations (e.g., concatenated observations) are shown at 110. Both the current observation and the retrieved historical observations can be input to a neural network model or a machine learning model as observations or observation text [(that is, generating a plurality of historical episode combinations, each of the plurality of historical episode combinations comprising N historical episodes and at least one most recent historical episode selected from the plurality of historical episodes)]”), and
[(d.1)(iii)](f) determining one or more of the plurality of historical episode combinations associated with optimal model parameters (Guo ¶ 0030 teaches “a neural network 114 is trained, based on observations and templates, to compute a template-aware observation representation and find a best fit response [(that is, “best fit response” pertains to optimal model parameters)]”; further, Guo ¶ 0056 teaches applying “a deep learning, e.g., the Deep Q-Network (DQN) to update the parameters θ of the RC-based action prediction model [114] [(that is, determining one or more of the plurality of historical episode combinations associated with optimal model parameters )]”); and
* * *
Fadel and Guo are from the same or similar field of endeavor. Fadel teaches reinforcement machine learning uses a reinforcement learning system that has an environment and an agent. Guo teaches a neural network that projects a response representation that can be learned via reinforcement learning. 
Thus, it would have been obvious to a person having ordinary skill in the art as of the effective filing date of the Applicant’s claimed invention to modify Fadel pertaining to reinforcement machine learning with the learnable response representation of Guo. 
The motivation to do so is to “provide for understanding of natural language and generating accurate responses or actions in an efficient manner. The system and/or method in an embodiment may implement Multi-Passage Reading Comprehension (MPRC) and harness MPRC techniques to solve the huge action space and partial observability challenges.” (Guo ¶ 0023).
Regarding claims 2 and 10, the combination of Fadel and Guo teaches all of the limitations of claims 1 and 9, respectively, as described above in detail. 
Fadel teaches -
wherein the predictive software agent machine learning model comprises a reinforcement learning machine learning model (Fadel, Abstract, teaches a “method for reinforcement machine learning uses a reinforcement learning system that has an environment and an agent [(that is, the predictive software agent machine learning model comprises a reinforcement learning machine learning model)]”).
Regarding claims 3, 11, and 17, the combination of Fadel and Guo teaches all of the limitations of claims 1, 9, and 16, respectively, as described above in detail.
Guo teaches -
further comprising:
[(f)] monitoring one or more agent interactions associated with the performance of the one or more prediction-based actions (Guo ¶ 0024 teaches “[a]n action value prediction (e.g., predicting the long-term rewards of taking an action) includes generating and scoring a compositional action structure by finding supporting evidence from an observation [(that is, “supporting evidence from an observation” is monitoring one or more agent interactions)]. In an aspect, each action is an instantiation of a template, e.g., verb phrase with a few placeholders of object arguments it takes, e.g., shown at 112 [(that is, monitoring one or more agent interactions associated with the performance of the one or more predict-based actions)]”); and
[(g)] generating the training agent interaction data based on the monitored one or more agent interactions (Guo ¶ 0024 teaches “the action generation process can be viewed as extracting objects for a template's placeholders from the textual observation, based on the interaction between the template verb phrase and the relevant context of the objects in the observation [(that is, “relevant context” is based on the monitored one or more agent interactions)]”; Guo ¶ 0030 teaches a “machine learning model, for example, a neural network 114 is trained, based on observations and templates, to compute a template-aware observation representation and find a best fit response [(that is, generating the training agent interaction data)]”).
Regarding claim 4, the combination of Fadel and Guo teaches all of the limitations of claim 1 as described above in detail. 
Fadel teaches -
[(f)] filtering agent actions from the action distribution (Fadel ¶ 0039 teaches a “constraining function being configured to take as its input the current state and to return an action mask indicating which of the actions [(that is, action distribution)] are enabled or disabled [(that is, the “action mask” is filtering agent actions from the action distribution)]”) based on the Boolean flags (Fadel ¶ 0110 teaches “[e]ach constraining function ƒi(⋅) takes as its input the state of the environment 250 at time step t, and outputs an action mask . . . in the form of a vector with an entry for each action in the action space, where the value of each entry is either a 1 or 0 [(that is, an “action mask” is based on the Boolean flags)]”).
Regarding claims 5, 12, and 18, the combination of Fadel and Guo teaches all of the limitations of claims 1, 9, and 16, respectively, as described above in detail. 
Guo teaches -
[(c)] wherein transforming the state representation into one or more sequence embeddings further comprises:
[(c.1)] tokenizing the historical event sequence data (Guo, Fig. 2, teaches a model architecture of a reading comprehension-based action prediction model [Examiner annotations in dashed-line text boxes]:

    PNG
    media_image5.png
    750
    943
    media_image5.png
    Greyscale

Guo ¶ 0045 teaches “tokenize the observation 234 and the verb phrase 236 into words shown at 202, 204, then embed these words into word vector representation, for example, using embeddings 206, 208 such as pre-trained GloVe embeddings. GloVe (Global Vectors for Word Representation) is an algorithm that generates word embeddings by aggregating global word-word co-occurrence matrix from a corpus”; Guo ¶¶ 0052-53 teaches “Past Observation Retrieval [(that is, historical)] . . . . [O]bservations from different time steps . . . are separated by a special token [(that is, tokenizing the historical event sequence data)]”); and
[(c.2)] normalizing the tokenized historical event sequence data into the fixed-length vectors (Guo ¶¶ 0044-45 teaches “Observation and Verb Representations . . . . A shared encoder block shown at 210, 212 that includes layer normalization (e.g., Layer-Norm) 214 and a neural network (e.g., Bidirectional gated recurrent unit (GRU)) 216, processes the observation and verb word embeddings to obtain the separate observation and verb representation. . . . Briefly, layer normalization normalizes internal layer features of a neural network, and it stabilizes the neural network training and substantially reduces the training time [(that is, normalizing the tokenized historical event sequence data into the fixed length vectors)]”;
[Examiner notes that a vector inherently has a fixed length, and accordingly, the broadest reasonable interpretation of the term “fixed length vectors” covers the teachings of Fadel (see above), which is not inconsistent with the Applicant’s disclosure. (MPEP § 2111). Further, an inherent “internal layer feature of a neural network” includes a data size, such as a “fixed vector length”]).
Regarding claims 6 and 13, the combination of Fadel and Guo teaches all of the limitations of claims 1 and 9, respectively, as described above in detail. 
Fadel teaches -
wherein the action distribution comprises a set of one or more agent actions associated with respective action reward values (Fadel ¶ 0004 teaches that “[t]o control the environment, the agent can perform a set of actions [(that is, the action distribution)] that may alter the state of this environment. For each action performed, the agent observes the change in the environment's state and a numerical signal, usually called a reward, that indicates if the action performed moved the agent closer or further to the completion of its goal [(that is, a set of one or more agent actions associated with respective action reward values)]”).
Regarding claims 7, 14, and 19, the combination of Fadel and Guo teaches all of the limitations of claims 1, 9, and 16, respectively, as described above in detail. 
Fadel teaches -
wherein the predictive software agent machine learning model comprises a deep Q network (in relation to Fig. 4, Fadel ¶ 0171 teaches “[a]n embodiment ofTutor4RL was implemented by modifying the Deep Q-Networks (DQN) agent (see Mnih), using the library Keras-RL (see Plappert et al, keras-rl (2016) available at github.com, the entire contents of which is hereby incorporated by reference herein) along with Tensorflow [(that is, the predictive software agent machine learning model comprises a deep Q network)]”).
Regarding claims 8, 15, and 20, the combination of Fadel and Guo teaches all of the limitations of claims 6, 14, and 19, respectively, as described above in detail. 
Fadel teaches -
wherein the deep Q network comprises an exploration phase (Fadel ¶ 0090 teaches “Embodiments also relate to reinforcement learning exploration [(that is, the deep Q network comprises an exploration phase)]”) and an exploitation phase (Fadel ¶ 0091 teaches “a [weakly supervised reinforcement learning (WSLR)] model is deployed by modifying and existing RL model such that it can incorporate and exploit knowledge functions [(that is, the deep-Q network comprises . . . an exploitation phase)]”).
Conclusion
10.	The prior art made of record and not relied upon is considered pertinent to Applicant's disclosure:
(Zahavy et al., "Learn What Not to Learn: Action Elimination with Deep Reinforcement Learning," arXiv (2019)) teaches an Action-Elimination Deep Q-Network (AE-DQN) architecture that combines a Deep RL algorithm with an Action Elimination Network (AEN) that eliminates sub-optimal actions.
(Sultana et al., "Reinforcement Learning for Multi-Product Multi-Node Inventory Management in Supply Chains," arXiv (2020)) teaches a novel formulation in a multi-agent (hierarchical) reinforcement learning framework that can be used for parallelised decision-making, and use the advantage actor critic (A2C) algorithm with quantised action spaces to solve the problem.
(US Published Application 20210174240 to Chakraborti et al.) teaches reinforcement learning software agents enhanced by external data. A reinforcement learning model supporting the software agent may be trained based on information obtained from one or more knowledge stores, such as online forums.
(US Published Application 20220164636 to Fadaie et al.) teaches a pre-determined constraint on user actions. A constraint vector is generated based on the pre-determined constraint converted into a legal action mask. 
11.	Any inquiry concerning this communication or earlier communications from the Examiner should be directed to KEVIN L. SMITH whose telephone number is (571) 272-5964. Normally, the Examiner is available on Monday-Thursday 0730-1730. 
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, Applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the Examiner by telephone are unsuccessful, the Examiner’s supervisor, KAKALI CHAKI can be reached on 571-272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/K.L.S./
Examiner, Art Unit 2122

/KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122
Read full office action
Prosecution Timeline

Feb 14, 2023
Application Filed
Feb 04, 2026
Non-Final Rejection — §101, §103
Mar 11, 2026
Interview Requested
Mar 25, 2026
Applicant Interview (Telephonic)
Mar 26, 2026
Examiner Interview Summary
Precedent Cases

Applications granted by this same examiner with similar technology

17/321,251
Patent 12591815
METHOD AND SYSTEM FOR UPDATING MACHINE LEARNING BASED CLASSIFIERS FOR RECONFIGURABLE SENSORS
2y 5m to grant Granted Mar 31, 2026
17/704,721
Patent 12585917
REINFORCEMENT LEARNING USING ADVANTAGE ESTIMATES
2y 5m to grant Granted Mar 24, 2026
16/994,396
Patent 12547759
PRIVACY PRESERVING MACHINE LEARNING MODEL TRAINING
2y 5m to grant Granted Feb 10, 2026
18/514,482
Patent 12530613
SYSTEMS AND METHODS FOR PERFORMING QUANTUM EVOLUTION IN QUANTUM COMPUTATION
2y 5m to grant Granted Jan 20, 2026
18/137,812
Patent 12518214
DISTRIBUTED MACHINE LEARNING SYSTEMS INCLUDING GENERATION OF SYNTHETIC DATA
2y 5m to grant Granted Jan 06, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
37%
Grant Probability
55%
With Interview (+18.0%)
4y 8m
Median Time to Grant
Low
PTA Risk
Based on 134 resolved cases by this examiner. Grant probability derived from career allow rate.