Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
1. This office action is in response to the rce filed on 12/15/2025. Claims 1-20 are pending and have been considered below.
Claim Rejections - 35 USC § 101
2. 35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-20 rejected under 35 U.S. C. 101 because the claimed invention is directed to an
abstract idea without significantly more
Step 1: Claims 1-10 recite a method. Claims 11-19 recite a non-transitory computer readable medium. Claims 20 recite a system. Therefore, claims 1-10 are directed to a process, claims 11-19 are directed to an article of manufacture, and claim 20 is directed to a machine.
With respect to claim 1,11,20
2A Prong 1: the claim(s) recites a judicial exception.
-generating action scores for actions that can be performed within the environment using the teacher model (Mental Process- creating a set of values to be given for each action- importance or effectiveness which can be performed in the human mind with a pen and paper.)
2A Prong 2: The additional elements recited in the claim do not integrate the abstract idea into a practical application, individually or in combination.
Additional Elements:
training a teacher model on an environment -(mere instructions to apply an exception – see MPEP 2106.05(f) )
- training a student model using pruned states of the environment (mere instructions to apply an exception – see MPEP 2106.05(f) )
retraining the student model using labels from the teacher model and the teacher action scores with temperature annealing (mere instructions to apply an exception – see MPEP 2106.05(f))
(Claim 11) A computer program product for machine learning, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a hardware processor to cause the hardware processor to (mere instructions to apply an exception – see MPEP 2106.05(f) )
(Claim 20) A machine learning system, comprising: a hardware processor; and a memory that stores a computer program, which, when executed by the hardware processor, causes the hardware processor to: (mere instructions to apply an exception – see MPEP 2106.05(f) )
2B: The claim(s) does not include additional elements that are sufficient to amount to significantly more than the judicial exception.
Additional Elements:
-training a teacher model on an environment is analogous to having a computer iteratively complete a task/calculation. (mere instructions to apply an exception – see MPEP 2106.05(f) )
- training a student model using pruned states of the environment is analogous to having a computer iteratively complete a task/calculation. (mere instructions to apply an exception – see MPEP 2106.05(f) )
retraining the student model using labels from the teacher model and the teacher action scores with temperature annealing (mere instructions to apply an exception – see MPEP 2106.05(f))
(Claim 11) A computer program product for machine learning, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a hardware processor to cause the hardware processor to (mere instructions to apply an exception – see MPEP 2106.05(f) )
(Claim 20) A machine learning system, comprising: a hardware processor; and a memory that stores a computer program, which, when executed by the hardware processor, causes the hardware processor to: (mere instructions to apply an exception – see MPEP 2106.05(f) )
Therefore, the claim(s) are ineligible.
With respect to claim(s) 2,12
2A Prong 1: the claim(s) recites a judicial exception.
-extracting unpruned states from the environment. (Mental Process – using environment as baseline of data which can be performed in the human mind with a pen and paper.))
Therefore, the claim(s) is/are ineligible.
With respect to claim(s) 3,13
2A Prong 1: the claim(s) recites a judicial exception.
- truncating the unpruned states to generate pruned states (Mental Process – eliminating unwanted information which can be performed in the human mind with a pen and paper.)
Therefore, the claim(s) is/are ineligible.
With respect to claim(s) 4,14
2A Prong 1: the claim(s) recites a judicial exception.
- the pruned states include verb-noun pairs (Mental Process – language structure (verb-noun commands) which can be performed in the human mind with a pen and paper.)
Therefore, the claim(s) is/are ineligible.
With respect to claim(s) 5,15
2A Prong 1: the claim(s) recites a judicial exception.
- retraining includes a distillation loss that uses temperature annealing and a student loss (Mathematical Concept – probabilities and data set reduction which can be performed in the human mind with a pen and paper.)
Therefore, the claim(s) is/are ineligible.
With respect to claim(s) 6,16
2A Prong 1: the claim(s) recites a judicial exception.
- teacher model includes determining weight values of the student model that minimize a loss function (Mathematical Concept – probabilities which can be performed in the human mind with a pen and paper.)
- the loss function includes a policy function that predicts a reward for performing an action given a present state and an expectation value for a state-action pair. (Mathematical Concept – loss function is a calculation using given formula in specification which can be performed in the human mind with a pen and paper.)
Therefore, the claim(s) is/are ineligible.
With respect to claim(s) 7,17
2A Prong 2: The additional elements recited in the claim do not integrate the abstract idea into a practical application, individually or in combination.
Additional Elements:
-teacher model includes a series of long short-term memory neural network layers is considered Field of Use. (See MPEP 2106.05(h))
2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.
Additional Elements:
-teacher model includes a series of long short-term memory neural network layers is considered Field of Use. (See MPEP 2106.05(h))
Therefore, the claim(s) is/are ineligible.
With respect to claim(s) 8,18
2A Prong 2: The additional elements recited in the claim do not integrate the abstract idea into a practical application, individually or in combination.
Additional Elements:
- the student model has a same neural network structure as the teacher model (series of long short-term memory neural network layers.) is considered Field of Use. (See MPEP 2106.05(h))
2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.
Additional Elements:
- the student model has a same neural network structure as the teacher model (series of long short-term memory neural network layers.) is considered Field of Use. (See MPEP 2106.05(h))
Therefore, the claim(s) is/are ineligible.
With respect to claim(s) 9,19
2A Prong 2: The additional elements recited in the claim do not integrate the abstract idea into a practical application, individually or in combination.
Additional Elements:
- The environment is a text game and the actions include commands that an agent can perform within the text game is considered Field of Use. (See MPEP 2106.05(h))
2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.
Additional Elements:
- The environment is a text game and the actions include commands that an agent can perform within the text game is considered Field of Use. (See MPEP 2106.05(h))
Therefore, the claim(s) is/are ineligible.
With respect to claim(s) 10
2A Prong 2: The additional elements recited in the claim do not integrate the abstract idea into a practical application, individually or in combination.
Additional Elements:
Navigating through a new environment using actions determined by the retrained student model (mere instructions to apply an exception – see MPEP 2106.05(f) )
2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.
Additional Elements:
Navigating through a new environment using actions determined by the retrained student model (mere instructions to apply an exception – see MPEP 2106.05(f) )
Claim Rejections - 35 USC § 102
3. The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.
Claims 1-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Bootstrapped Q-learning with Context Relevant Observation Pruning to Generalize in Text-based Games, Chaudhury et al, hereafter Chaudhury.
Regarding claim 1, Chaudhury teaches:
training a teacher model on an environment ([Page 1, Column 2, Paragraph 1] teaches CREST, which first trains an overfitted base model on the original observation text in training games using Q-learning.)
generating action scores for actions that can be performed within the environment using the teacher model. ([Page 3, Column 2, Paragraph 2] teaches Token Relevance Distribution (TRD): We run inference on the overfitted base model for each training game (indexed by k) and aggregate all the action tokens issued for that particular game as the Episodic Action Token Aggregation (EATA).)
training a student model using pruned states of the environment. ([Page 1, Column 2, Paragraph 1] teaches first trains an overfitted base model on the original observation text in training games, apply observation pruning, then re-train a bootstrapped policy on the pruned observation text.)
retraining the student model [Page 1, Column 1, Paragraph 1] teaches the base model's action token distribution is used to perform observation pruning that removes irrelevant tokens. A second bootstrapped model is then retrained on the pruned observation text.) using labels ([Page 1, Column 2, Paragraph 1] teaches subsequently, we apply observation pruning such that, for each episode of the training games, we remove the observation tokens that are not semantically related to the base policy's action tokens.) from the teacher model and the teacher action scores. ([Page 1, Figure 1] teaches Our method retains context-relevant tokens from the observation text (shown in green) while pruning irrelevant tokens (shown in red). A second policy network re-trained on the pruned observations generalizes better by avoiding overfitting to unwanted tokens.) with temperature annealing. ([Page 7, Column 1, Paragraph 2] teaches we trained each environment for 6000 epochs with annealing of 3600 epochs from a starting value of 1.0 to 0.2.) Where temperature annealing is defined in the specification of the instant application ([0032] Temperature annealing, in a machine learning context, changes the sharpness of a probability distribution, thereby making it sharper for temperature values that are less than 1.0 and flatter for temperature values that a greater than 1.0.)
PNG
media_image1.png
494
705
media_image1.png
Greyscale
Figure 1
Regarding claim 2, the claim recites similar limitation as corresponding to claim 1 and is rejected for similar reasons as claim 1 using similar teachings and rationale. Chaudhury also teaches:
teacher model includes extracting unpruned states from the environment. ([Page 1, Column 2, Paragraph 1] teaches To alleviate this problem, we propose CREST, which first trains an overfitted (unpruned) base (teacher) model on the original observation text in training games (environment) using Q-learning.
Regarding claim 3, the claim recites similar limitation as corresponding to claim 1 and is rejected for similar reasons as claim 1 using similar teachings and rationale. Chaudhury also teaches:
truncating the unpruned states to generate pruned states. ([Page 1, Column 2, Paragraph 1] teaches we propose CREST, which first trains an overfitted (unpruned) base model on the original observation text in training games using Q-learning. Subsequently, we apply observation pruning (truncating) such that, for each episode of the training games, we remove the observation tokens that are not semantically related to the base policy's action tokens.
Regarding claim 4, the claim recites similar limitation as corresponding to claim 1 and is rejected for similar reasons as claim 1 using similar teachings and rationale. Chaudhury also teaches:
the pruned states include verb-noun pairs. ([Page 1, Column 2, Paragraph 1] teaches we re-train a bootstrapped policy on the pruned observation text using Q-learning that improves generalization by removing irrelevant tokens. Figure 1 shows an illustrative example of our method. [Page 2, Column 2, Paragraph 1] teaches The action consists of a combination of verb and object output, such as "go north", "take coin", etc.)
PNG
media_image2.png
353
360
media_image2.png
Greyscale
Regarding claim 6, the claim recites similar limitation as corresponding to claim 1 and is rejected for similar reasons as claim 1 using similar teachings and rationale. Chaudhury also teaches:
training the teacher model includes determining weight values of the student model that minimize a loss function. ([Page 3, Column 2, Paragraph 3] teaches Token Relevance Distribution (TRD): We run inference on the overfitted base model for each training game (indexed by k) and aggregate all the action tokens issued for that particular game as the Episodic Action Token Aggregation (EATA), Ak. For each token Wi in a given observation text o} at step t for the k th game, we compute the Token Relevance Distribution (TRD) C"..."This relevance score is used to prune irrelevant tokens in the observation text by creating a hard attention mask using a threshold value.)
the loss function includes a policy function that predicts a reward for performing an action given a present state and an expectation value for a state-action pair. ([Page 3, Column 1, Paragraph 1] teaches The parameters of the model are updated, by optimizing the following loss function obtained from the Bellman equation (Sutton et al., 1998),where Q(s, a)(where Q is expectation value, s is present state, and a is action) is obtained as the average of verb and object Q-values, r E (0, 1) is the discount factor. The agent is given a reward of 1 from the environment on completing the objective.)
Regarding claim 7, the claim recites similar limitation as corresponding to claim 1 and is rejected for similar reasons as claim 1 using similar teachings and rationale. Chaudhury also teaches:
teacher model includes a series of long short-term memory neural network layers. ([Page 7, Column 1, Paragraph 2] teaches We use a single LSTM network (teacher) with 100 dimensional hidden units (series of LSTM network layers) in the representation generator. For the action scorer, a single LSTM network with 64-dim hidden unit (for DRQN ) and two MLPs for verb and object Q-values were used. The number of trainable parameters in our policy network is 128,628 for the model with attention and 125,364 without attention.)
Regarding claim 8, the claim recites similar limitation as corresponding to claim 1 and is rejected for similar reasons as claim 1 using similar teachings and rationale. Chaudhury also teaches:
the student model has a same neural network structure as the teacher model [Page 3, Column 2, Paragraph 3] teaches the bootstrapped model is trained on the pruned observation text by removing irrelevant tokens using TRDs. Same model architecture and training methods as the base model are used.
Regarding claim 9, the claim recites similar limitation as corresponding to claim 1 and is rejected for similar reasons as claim 1 using similar teachings and rationale. Chaudhury also teaches:
The environment is a text game and the actions include commands that an agent can perform within the text game. ([Page 1, Column 1, Paragraph 2] teaches to interact with the environment, the agent issues text-based action commands ("go west") upon which it receives a reward signal used for training the RL agent.)
Regarding claim 10, the claim recites similar limitation as corresponding to claim 1 and is rejected for similar reasons as claim 1 using similar teachings and rationale. Chaudhury also teaches:
Navigating through a new environment using actions determined by the retrained student model. ([Page 1, Figure 1] teaches Our method retains context-relevant tokens from the observation text (shown in green) while pruning irrelevant tokens (shown in red). A second policy network re-trained on the pruned observations generalizes better by avoiding overfitting to unwanted tokens.)
PNG
media_image1.png
494
705
media_image1.png
Greyscale
Figure 1
Regarding claim 11, the claim recites similar limitation as corresponding claim 1 and is rejected for similar reasons as claim 1 using similar teachings and rationale. Chaudhury also teaches:
A computer program product for machine learning, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a hardware processor. [Page 7, Column 1, Paragraph 2] teaches Our experiments were conducted on a Ubuntul6.04 system with a Titan X (Pascal) GPU. (Where Ubuntu16.04 requires processors, RAM and hard drive space to operate.) [Page 6, Column 2, Paragraph 1] teaches We used Textworld (Cote et al., 2018) framework for evaluating our model's generalization ability on text-based games. (Where Microsoft Textworld is an open-source, extensible engine that both generates and simulates text games. You can use it to train reinforcement learning (RL) agents to learn skills such as language understanding and grounding, combined with sequential decision making.)
Regarding claim 12, the claim recites similar limitation as corresponding claim 2 and is rejected for similar reasons as claim 2 using similar teachings and rationale. Claim 12 also inherits the deficiencies based on dependence of claim 11 and is rejected using similar teachings and rationale.
Regarding claim 13, the claim recites similar limitation as corresponding claim 3 and is rejected for similar reasons as claim 3 using similar teachings and rationale. Claim 13 also inherits the deficiencies based on dependence of claim 11 and is rejected using similar teachings and rationale.
Regarding claim 14, the claim recites similar limitation as corresponding claim 4 and is rejected for similar reasons as claim 4 using similar teachings and rationale. Claim 14 also inherits the deficiencies based on dependence of claim 11 and is rejected using similar teachings and rationale.
Regarding claim 16, the claim recites similar limitation as corresponding claim 6 and is rejected for similar reasons as claim 6 using similar teachings and rationale. Claim 16 also inherits the deficiencies based on dependence of claim 11 and is rejected using similar teachings and rationale.
Regarding claim 17, the claim recites similar limitation as corresponding claim 7 and is rejected for similar reasons as claim 7 using similar teachings and rationale. Claim 17 also inherits the deficiencies based on dependence of claim 11 and is rejected using similar teachings and rationale.
Regarding claim 18, the claim recites similar limitation as corresponding claim 8 and is rejected for similar reasons as claim 8 using similar teachings and rationale. Claim 18 also inherits the deficiencies based on dependence of claim 11 and is rejected using similar teachings and rationale.
Regarding claim 19, the claim recites similar limitation as corresponding claim 9 and is rejected for similar reasons as claim 9 using similar teachings and rationale. Claim 19 also inherits the deficiencies based on dependence of claim 11 and is rejected using similar teachings and rationale.
Regarding claim 20, the claim recites similar limitation as corresponding claim 1 and is rejected for similar reasons as claim 1 using similar teachings and rationale. Chaudhury also teaches:
A machine learning system, comprising: a hardware processor; and a memory that stores a computer program. [Page 7, Column 1, Paragraph 2] teaches Our experiments were conducted on a Ubuntul6.04 system with a Titan X (Pascal) GPU. (Where Ubuntu16.04 requires processors, RAM and hard drive space to operate.)
Claim Rejections - 35 USC § 103
4. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 13 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Bootstrapped Q-learning with Context Relevant Observation Pruning to Generalize in Text-based Games, Chaudhury et al, hereafter Chaudhury in view of ALISTARH (US 2020/0104717).
Regarding claim 5, the claim recites similar limitation as corresponding to claim 1 and is rejected for similar reasons as claim 1 using similar teachings and rationale. Chaudhury does not teach:
Wherein retraining includes a distillation loss that uses temperature annealing. However, ALISTARH discloses wherein retraining includes a distillation loss that uses temperature annealing ([0038],[0041], [0047]-[0048]). Therefore, It would have been obvious to one of ordinary skill in the art, at or before the effective filing date of the instant application, to use the feature of ALISTARH in Chaudhury. One would have been motivated to reduce the amount of computation, and therefore reduce running times.
Regarding claim 15, the claim recites similar limitation as corresponding claim 5 and is rejected for similar reasons as claim 5 using similar teachings and rationale. Claim 15 also inherits the deficiencies based on dependence of claim 11 and is rejected using similar teachings and rationale.
Response to Arguments
5. Applicant’s arguments filed 12/15/2026 have been fully considered but they are not persuasive.
Applicants argue “The result of the policy distillation, in the retraining of the student model, is a superior machine learning model with improved generalizability. The claims reflect this improvement in the step of retraining the student model using labels from the teacher model and the teacher action scores. As a result, both requirements for showing an improvement to a technology have been met."
In response, the Examiner respectfully disagrees and submits:
First, the claim conflates performance improvement with technological improvement. Even if the student model performs better on certain benchmarks, that does not necessarily satisfy broader criteria for “improvement to a technology.” For example, has robustness improved across distribution shifts? Is the model more interpretable, efficient, or reliable in real-world settings? Without evidence across these dimensions, the conclusion is overstated.
Second, the argument assumes that using both teacher labels and action scores meaningfully enriches the training signal. But this depends heavily on how informative and well-calibrated those scores are. If the teacher’s confidence estimates are poorly calibrated, the student may learn misleading gradients, undermining the claimed benefit.
Finally, the statement that “both requirements…have been met” is conclusory rather than analytical. It doesn’t clearly define those requirements or show how they were rigorously evaluated. Without explicit criteria and empirical validation, the conclusion reads more like an assertion than a demonstrated result.
Applicants further argue "The features of claim 10, navigating through a new environment using actions determined by the retrained student model, further reinforce the assertion that the claims reflect the improvement that is described in the specification. The claim recites using the retrained student model in a new environment, where the retrained student model provides superior performance."
In response, the Examiner respectfully disagrees and submits:
The argument relies heavily on assertion rather than substantiation. Simply reciting that the retrained student model operates in a “new environment” with “superior performance” does not establish that such improvement actually occurs. Claim language is not evidence; without concrete metrics, experimental results, or defined evaluation criteria, the statement of superiority is conclusory.
Moreover, the notion of a “new environment” is left undefined and therefore weak. Not all new environments meaningfully test generalization—if the environment is only trivially different from the training distribution, then performance gains do not demonstrate true robustness or technological advancement. The claim fails to show that the model handles distributional shift, which is the core challenge in generalization.
A retrained student model may be more efficient, but efficiency gains do not automatically translate into improved decision quality or navigation capability. In some cases, distilled models perform worse than their teachers in complex or unfamiliar settings.
Additionally, the argument does not establish a causal link between the claimed features (use of teacher labels and action scores during retraining) and the alleged superior performance in the new environment. Without showing that these specific features drive the improvement—as opposed to other factors such as model architecture, training data, or evaluation setup—the claim lacks technical grounding.
Finally, the reasoning is circular: it argues that the claims reflect an improvement because they state that the model achieves superior performance. This is not a demonstration of improvement but a restatement of the claim itself.
Conclusion
6. The prior art made of record and not relied upon is considered pertinent to applicant's disclosure (See PTO-892).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Phenuel S. Salomon whose telephone number is (571) 270-1699. The examiner can normally be reached on Mon-Fri 7:00 A.M. to 4:00 P.M. (Alternate Friday Off) EST.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Usmaan Saeed can be reached on (571) 272-4046. The fax phone number for the organization where this application or proceeding is assigned is 571-273-3800.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/PHENUEL S SALOMON/Primary Examiner, Art Unit 2146