Last updated: May 29, 2026
Application No. 18/620,389
UNLEARNING DATA FROM LANGUAGE MODELS

Final Rejection §102§103
Filed
Mar 28, 2024
Examiner
SMITH, SEAN THOMAS
Art Unit
2659
Tech Center
2600 — Communications
Assignee
Amazon Technologies, Inc.
OA Round
2 (Final)
Interview Optional

— +25.0% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 86% grant rate with +25.0% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 7 resolved cases, 2023–2026
Examiner Intelligence

SMITH, SEAN THOMAS View full profile →
Grants 86% — above average
Career Allowance Rate
6 granted / 7 resolved
+23.7% vs TC avg
Strong +25% interview lift
Without
With
+25.0%
Interview Lift
resolved cases with interview
Typical timeline
2y 9m
Avg Prosecution
26 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
3.1%
-36.9% vs TC avg
§103
92.8%
+52.8% vs TC avg
§102
1.0%
-39.0% vs TC avg
§112
3.1%
-36.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 7 resolved cases
Office Action

§102 §103
DETAILED ACTION
This Office Action is responsive to arguments filed on February 19th, 2026. Claims 1-20 are pending and have been examined; hence, this action is made FINAL.
Any previous objections/rejections not mentioned in this Office Action have been withdrawn by the Examiner.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on September 16th, 2025 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Response to Arguments
With regard to rejections made under 35 U.S.C. 102, Applicant argues, "Wang discloses the re-training of two models with the same architecture (A) as the original model: a first model trained on the data to be forgotten and a second model on a small set of extra data (not present in the initial training data), to fine tune the initial model using knowledge gap alignment… Wang does not disclose 'training a second auxiliary LM using a second training corpus D/F, wherein the second training corpus D/F represents the first training corpus D without the first data F,' as recited in claim 4," (emphasis original, page 10 or Remarks).
Applicant further argues, "The knowledge gap alignment uses two distinct comparisons to adjust the parameters of the model: first, a comparison of the outputs of model AD and An on input data Dn, and second a comparison of the outputs of model A and Af on input data Df, see Wang equation 2. This is not the same as 'updating the first LM based at least in part on a first prediction difference between a first prediction of the first LM and a second prediction of the second auxiliary LM for the first text input,' as recited in claim 4 (emphasis added) wherein the second auxiliary LM is a model trained on a corpus D/F (e.g., training corpus D without the first data F). As such, none of the models disclosed in Wang for the forgetting of data (e.g., An, Af, and AD) are the same as the second auxiliary model trained on D/F. Furthermore, the knowledge gap alignment disclosed in Wang does not teach 'updating the first LM based at least in part on ... a second prediction difference between the first prediction of the first LM and a third prediction of the first auxiliary LM for the first text input,' (emphasis added), as recited in claim 4, wherein the first auxiliary LM is a model trained on the full training corpus D. This is because claim 4 recites the same 'first text input' for the first LM, the first n-gram language model, and the second n-gram language model to determine the various recited predictions and prediction differences. By contrast, Wang describes input data Dn to compare the outputs of models AD and An and also the different input data Df to generate and compare the outputs of models A and Af, see Wang equation 2," (emphasis original, page 11 of Remarks).
Applicant argues that Wang fails to teach the elements necessary to achieve the claimed invention, those elements being an arrangement of learning models wherein an original model is trained by comparison to auxiliary models, each trained on respective sets of training data.
Applicant’s argument has been considered, but is not persuasive. Regarding the arrangement of auxiliary models, Wang teaches an original model A trained on a corpus D, a second model An trained on a corpus Dn, and a third model Af trained on a corpus Df. Wang further teaches that the auxiliary models An and Af may be trained “with the combination of Dn (Df) and a small fraction of Dr = D/Df…” (page 3, KGA Framework), contemplating an arrangement wherein one model may be used as an “origin teacher” model while another may be used as a “goal teacher,” the outcome being the original model A behaving similar to its original state, with the absence of F (or Df) in its learned parameters.
Under the broadest reasonable interpretation of the claims, a model A(D) is updated be comparison to the auxiliary models a(origin) and a(goal), wherein the origin behavior serves as a reference point and goal behavior is similar to that of the reference point with the absence of dataset F, and therefore Wang reads on the claims.
Further, Applicant’s arguments with regard to the “first text input” are not persuasive. While Wang teaches in equation (3) that the distributions of respective models are compared with inputs y and z, the broadest reasonable interpretation of the claims permits the “first text input” comprising a set – as in a “first iteration” or “first round” of text input – that includes elements y and z. Accordingly, Wang discloses or otherwise teaches the limitations of the claimed invention, and the rejections under 35 U.S.C. 102 are maintained. Further details are provided below.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claims  4, 7, 10, 13, 16 and 19 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by "KGA: A General Machine Unlearning Framework Based on Knowledge Gap Alignment" by Wang et al. (hereinafter, "Wang").
Regarding claims 4 and 13, Wang teaches a method and system comprising:determining a first language model (LM) trained on a first training corpus D (section 4.1 KGA Framework, "Apart from data, we have model A(D) as input, which is the original model trained with data D that needs unlearning (we abbreviate it as AD in the following parts of this paper).");determining first data F, wherein the first data F is a subset of D (section 4.1 KGA Framework, "The input data consists of previous training data D, data to be forgotten Df, and a small set of extra data Dn to assist the unlearning, where Dn ∩ D = ∅.");training a first auxiliary LM using the first training corpus D (section 4.1 KGA Framework, "To perform unlearning, we first train two models, An and Af , based on data Dn and Df , respectively.");training a second auxiliary LM using a second training corpus D/F, wherein the second training corpus D/F represents the first training corpus D without the first data F (section 4.1 KGA Framework, "To perform unlearning, we first train two models, An and Af , based on data Dn and Df , respectively. The architectures of AD, An, and Af should be the same. An (Af ) can be trained with the combination of Dn (Df ) and a small fraction of Dr = D \ Df or fine-tuned based on some pre-trained language models to ensure performance, as the data to be forgotten Df might be small in some scenarios.");determining a first text input (page 4, Objectives, "Pr(A)(z) is the output distribution given input z to model A, KL(a|b) measures the KL divergence between distribution a and b."); andupdating the first LM based at least in part on a first prediction difference between a first prediction of the first LM and a second prediction of the second auxiliary LM for the first text input and a second prediction difference between the first prediction of the first LM and a third prediction of the first auxiliary LM for the first text input (page 4, Objectives, “In our implementation, we use KL divergence to measure the distributional distances between the output of two models. Therefore, the knowledge gap alignment objective is defined as: [                                
                                    
                                            L
                                        
                                            a
                                        
                            , Equation (3)]… The objective for maintaining performance on Dr is another KL divergence measuring output distribution of A* and AD on Dr: [                                
                                    
                                            L
                                        
                                            r
                                        
                            , Equation (4)]… The two objectives are jointly optimized during unlearning to achieve Goal 1 and 2 simultaneously. Therefore, the final objective is defined as:                                 
                                    L
                                     
                                    =
                                    
                                            L
                                        
                                            a
                                        
                                    +
                                     
                                    α
                                     
                                    ·
                                    
                                            L
                                        
                                            r
                                        
                             [Equation (5)]" and “Specifically, we will first evaluate the average knowledge gap between                                 
                                    d
                                    i
                                    
                                            s
                                        
                                                            D
                                                        
                                                            n
                                                        
                                                    A
                                                
                                                    D
                                                
                                            ,
                                            
                                                    A
                                                
                                                    n
                                                
                             and                                 
                                    d
                                    i
                                    
                                            s
                                        
                                                            D
                                                        
                                                            f
                                                        
                                    (
                                    
                                            A
                                        
                                            D
                                        
                                    ,
                                    
                                            A
                                        
                                            f
                                        
                                    )
                                
                            …”).
The combination of Equations (3), (4) and (5) discloses a method wherein the output distribution differences are calculated for at least an original model and two auxiliary models, leading to an updated model whose original performance is preserved while the selected data subset F (or Df) is unlearned.
Regarding claims 7 and 16, Wang teaches a method and system further comprising determining the first prediction difference using a Kullback-Leibler divergence between a first probability distribution of the first LM for the first text input and a second probability distribution of the second auxiliary LM for the first text input (page 3, Knowledge Gap Alignment, "To achieve Goal 1, the output distribution of our target model A∗ on data Df (noted as A∗ (Df)) is expected to be similar to AD(Dn), where Dn should be an external set to D but with the similar distribution," and, "dis(D)(A1, A2) indicates the difference of the output distributions between model A1 and A2 on data D, which can be evaluated by KL divergence, Bregman divergence, or any other distributional distance measurements.").
Regarding claims 10 and 19, Wang teaches a method and system further comprising:partitioning the first training corpus D into n training partitions (page 2, Exact Unlearning, "As for more recent efforts in neural model unlearning, Bourtoule et al. (2021) propose a general method called SISA to train the model by partitioning the original dataset into several non-overlapping shards first and then designing effective mechanisms to aggregate models trained with shards.");training a first plurality of auxiliary models using the n training partitions, wherein each auxiliary model of the first plurality of auxiliary models is trained using a respective one of the n training partitions (page 2, Exact Unlearning, "When handling data deletion, this method only has to retrain the models trained with the affected shards."); andgenerating the second auxiliary LM by aggregating the first plurality of auxiliary models (page 2, Exact Unlearning, "As for more recent efforts in neural model unlearning, Bourtoule et al. (2021) propose a general method called SISA to train the model by partitioning the original dataset into several non-overlapping shards first and then designing effective mechanisms to aggregate models trained with shards.").
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3 are rejected under 35 U.S.C. 103 as being unpatentable over Wang in view of China Invention Application CN 116776160 to Shi (hereinafter, "Shi") and U.S. Patent Application Publication 2025/0265447 to Li et al. (hereinafter, "Li").
Regarding claim 1, Wang teaches A computer-implemented method comprising:determining a first language model (LM) trained on a first training corpus D (section 4.1 KGA Framework, "Apart from data, we have model A(D) as input, which is the original model trained with data D that needs unlearning (we abbreviate it as AD in the following parts of this paper).");determining first data F, wherein the first data F is a subset of D, and wherein the first data F is identified as a set of data to be unlearned by the first LM (section 4.1 KGA Framework, "The input data consists of previous training data D, data to be forgotten Df, and a small set of extra data Dn to assist the unlearning, where Dn ∩ D = ∅.");training a first n-gram language model using the first training corpus D (section 4.1 KGA Framework, "To perform unlearning, we first train two models, An and Af , based on data Dn and Df , respectively. The architectures of AD, An, and Af should be the same. An (Af ) can be trained with the combination of Dn (Df ) and a small fraction of Dr = D \ Df or fine-tuned based on some pre-trained language models to ensure performance, as the data to be forgotten Df might be small in some scenarios.");training a second n-gram language model using a second training corpus D/F, wherein the second training corpus D/F represents the first training corpus D without the first data F (section 4.1 KGA Framework, "To perform unlearning, we first train two models, An and Af , based on data Dn and Df , respectively. The architectures of AD, An, and Af should be the same. An (Af ) can be trained with the combination of Dn (Df ) and a small fraction of Dr = D \ Df or fine-tuned based on some pre-trained language models to ensure performance, as the data to be forgotten Df might be small in some scenarios.");determining a first text input of the first data F (page 4, Objectives, "Pr(A)(z) is the output distribution given input z to model A, KL(a|b) measures the KL divergence between distribution a and b. y and z are from Dn and Df, respectively."); andupdating the first LM to generate an updated LM based at least in part by:minimizing a first prediction difference between a first prediction of the first LM and a second prediction of the second n-gram language model for the first text input (page 4, Knowledge Gap Alignment, "For Goal 2, we maintain the ability of model A∗ when processing the remaining data, i.e., Dr. We treat the original model AD as a teacher and directly minimize the distance of output distributions when feeding samples in Dr to A∗ and AD."). Examiner interprets “prediction difference” as analogous to “distance of output distributions”. The instant claim compares the output from two different models and adjusts one model such that their outputs are similar. As taught by Wang, a teacher and student model’s output distribution is minimized to align their outputs on the same input data.
Wang does not explicitly teach “maximizing a second prediction difference between the first prediction of the first LM and a third prediction of the first n-gram language model for the first text input,” and thus, Li is introduced.
Li teaches maximizing a second prediction difference between the first prediction of the first LM and a third prediction of the first n-gram language model for the first text input (page 2, "based on the first target training direction, training to obtain a click behaviour prediction model, the click behaviour prediction model is used for predicting the probability that the object clicks on the content to be recommended, the first target training direction is to minimize the similarity difference between the anchor point data and the positive sample data, and maximizing a similarity difference between the anchor point data and the first difficult negative sample data.").
Wang and Shi are considered analogous because they are each concerned with language model training. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Wang with the teachings of Shi for the purpose of achieving desired language model performance. Given that all the claimed elements were known in the prior art, one skilled in the art could have combined the elements by known methods with no change in their respective functions, and the combination would have yielded nothing more than predictable results.
The combination of Wang and Shi does not explicitly teach language models that are n-gram models, and thus, Li is introduced. Li teaches at paragraph [0098], "For instance, applying the language models to generate the logits includes, for each language model (e.g., for model 241 and model 242), tokenizing the input (user input query) into discrete tokens (e.g., words or sub words) and then processing the tokens according to the specific architectures and parameters of each language model," and paragraph [0102], "At 380, the method includes generating an output token (or tokens in some cases) based on the determined (one or more) probabilities." Paragraph [0052] of the Specification discloses, "An N-gram LM indexes the statistics of n-grams (words and/or portions of words (e.g., lemmatized and/or stemmed tokens, etc.)) in training data and may use maximum likelihood estimation (MLE) to estimate probability of generated outputs."
Wang, Shi and Li are considered analogous because they are each concerned with language model training. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Wang and Shi with the teachings of Li for the purpose of model training efficiency, in view of design incentives disclosed in paragraph [0052] of the Specification, “Due to the properties of N-gram LMs, unlearning is trivial as it can be done by simply subtracting out the corresponding statistics of F from the model.”
Regarding claim 2, Wang teaches The computer-implemented method of claim 1, further comprising:determining first loss comprising the first prediction difference (page 4, Knowledge Gap Alignment, "For Goal 2, we maintain the ability of model A∗ when processing the remaining data, i.e., Dr. We treat the original model AD as a teacher and directly minimize the distance of output distributions when feeding samples in Dr to A∗ and AD."). Examiner interprets “prediction difference” as analogous to “distance of output distributions”. The instant claim compares the output from two different models and adjusts one model such that their outputs are similar. As taught by Wang, a teacher and student model’s output distribution is minimized to align their outputs on the same input data.
Wang does not explicitly teach “determining second loss comprising the second prediction difference,” or “generating the updated LM based at least in part by updating parameters of the first LM to decrease the first loss and to increase the second loss,” however, Shi teaches determining second loss comprising the second prediction difference (page 10, “wherein the first expected loss function is used for identifying the second similarity between the anchor point data and the positive sample data, and the first similarity between the anchor point data and the first difficult negative sample data, the second expected loss function is used for the second similarity between the anchor point data and the positive sample data,”); andgenerating the updated LM based at least in part by updating parameters of the first LM to decrease the first loss and to increase the second loss (page 2, "based on the first target training direction, training to obtain a click behaviour prediction model, the click behaviour prediction model is used for predicting the probability that the object clicks on the content to be recommended, the first target training direction is to minimize the similarity difference between the anchor point data and the positive sample data, and maximizing a similarity difference between the anchor point data and the first difficult negative sample data.").
Wang and Shi are considered analogous because they are each concerned with language model training. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Wang with the teachings of Shi for the purpose of achieving desired language model performance. Given that all the claimed elements were known in the prior art, one skilled in the art could have combined the elements by known methods with no change in their respective functions, and the combination would have yielded nothing more than predictable results.
Regarding claim 3, Wang teaches The computer-implemented method of claim 1, wherein:the first prediction difference comprises a Kullback-Leibler divergence between a first probability distribution of the first LM for the first text input and a second probability distribution of the second n-gram language model for the first text input (page 4, Knowledge Gap Alignment, "For Goal 2, we maintain the ability of model A∗ when processing the remaining data, i.e., Dr. We treat the original model AD as a teacher and directly minimize the distance of output distributions when feeding samples in Dr to A∗ and AD," and, "dis(D)(A1, A2) indicates the difference of the output distributions between model A1 and A2 on data D, which can be evaluated by KL divergence, Bregman divergence, or any other distributional distance measurements."); andthe second prediction difference comprises the Kullback-Leibler divergence between the first probability distribution of the first LM for the first text input and a third probability distribution of the first n-gram language model for the first text input (page 3, Knowledge Gap Alignment, "To achieve Goal 1, the output distribution of our target model A∗ on data Df (noted as A∗ (Df)) is expected to be similar to AD(Dn), where Dn should be an external set to D but with the similar distribution," and, "dis(D)(A1, A2) indicates the difference of the output distributions between model A1 and A2 on data D, which can be evaluated by KL divergence, Bregman divergence, or any other distributional distance measurements.").
Claims 5 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Wang in view of Shi.
Regarding claims 5 and 14, Wang does not explicitly teach a method or system “further comprising updating the first LM based at least in part by updating parameters of the first LM to decrease the first prediction difference and increase the second prediction difference,” however, Shi teaches updating the first LM based at least in part by updating parameters of the first LM to decrease the first prediction difference and increase the second prediction difference (page 2, "based on the first target training direction, training to obtain a click behaviour prediction model, the click behaviour prediction model is used for predicting the probability that the object clicks on the content to be recommended, the first target training direction is to minimize the similarity difference between the anchor point data and the positive sample data, and maximizing a similarity difference between the anchor point data and the first difficult negative sample data.").
Wang and Shi are considered analogous because they are each concerned with language model training. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Wang with the teachings of Shi for the purpose of achieving desired language model performance. Given that all the claimed elements were known in the prior art, one skilled in the art could have combined the elements by known methods with no change in their respective functions, and the combination would have yielded nothing more than predictable results.
Claims 6, 8, 9, 15, 17 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Wang in view of Li.
Regarding claims 6 and 15, Wang does not explicitly teach a method or system “wherein a first number of parameters of the first LM is at least a magnitude greater than a second number of parameters of the first auxiliary LM,” however, Li teaches a first number of parameters of the first LM is at least a magnitude greater than a second number of parameters of the first auxiliary LM (paragraph [0146], "An analysis of mitigation performance is presented, with a focus on the results obtained using the Pythia 2.8B LLM and the Pythia 160M small language model (SLM). As demonstrated in Table 800 of FIG. 8, the ensemble strategy employing an SLM with only 160M parameters effectively reduces PII leakage while maintaining optimal model performance.").
Wang and Li are considered analogous because they are each concerned with language model training. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Wang with the teachings of Li for the purpose of improving model training efficiency. Given that all the claimed elements were known in the prior art, one skilled in the art could have combined the elements by known methods with no change in their respective functions, and the combination would have yielded nothing more than predictable results.
Regarding claims 8 and 17, Wang does not explicitly teach a method or system “wherein the first auxiliary LM is a first n-gram LM and the second auxiliary LM is a second n-gram LM,” however, Li teaches the first auxiliary LM is a first n-gram LM and the second auxiliary LM is a second n-gram LM (paragraph [0098], "For instance, applying the language models to generate the logits includes, for each language model (e.g., for model 241 and model 242), tokenizing the input (user input query) into discrete tokens (e.g., words or sub words) and then processing the tokens according to the specific architectures and parameters of each language model," and paragraph [0102], "At 380, the method includes generating an output token (or tokens in some cases) based on the determined (one or more) probabilities."). Paragraph [0052] of the Specification discloses, "An N-gram LM indexes the statistics of n-grams (words and/or portions of words (e.g., lemmatized and/or stemmed tokens, etc.)) in training data and may use maximum likelihood estimation (MLE) to estimate probability of generated outputs."
Wang and Li are considered analogous because they are each concerned with language model training. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Wang with the teachings of Li for the purpose of model training efficiency, in view of design incentives disclosed in paragraph [0052] of the Specification, “Due to the properties of N-gram LMs, unlearning is trivial as it can be done by simply subtracting out the corresponding statistics of F from the model.”
Regarding claims 9 and 18, Wang does not explicitly teach a method or system “wherein the second auxiliary LM is a language model trained on a dataset comprising public domain text data,” however, Li teaches the second auxiliary LM is a language model trained on a dataset comprising public domain text data (paragraph [0061], "For example, LLMs are trained on massive datasets that may contain copyrighted material, making it difficult to ensure compliance with intellectual property (IP) rights. This can result in legal, ethical, and financial consequences, especially for code models trained on open-source repositories without adhering to licensing terms. Lawsuits have already been filed against various organizations reproducing licensed code without following license terms. Various proposals have been put forth to mitigate this risk, including filtering licensed data and implementing techniques to 'unlearn' specific training data.").
Wang and Li are considered analogous because they are each concerned with language model training. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Wang with the teachings of Li for the purpose of legitimizing model training, in view of design incentives disclosed in paragraph [0012] of the Specification, “For example, a copyright owner of a work (e.g., a written work, an artwork, etc.) may want to have their work removed from the training corpus of an LM.”
Claims 11 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Wang in view of "Machine Unlearning" by Bourtoule et al. (hereinafter, "Bourtoule").
Regarding claims 11 and 20, Wang does not explicitly teach a method or system “wherein data of the n training partitions excludes the first data F,” and thus, Bourtoule is introduced. Bourtoule teaches data of the n training partitions excludes the first data F (page 2, Introduction, "In addition, rather than training each model on the entire shard directly, we can divide each shard’s data into slices and present slices incrementally during training. We save the state of model parameters before introducing each new slice, allowing us to start retraining the model from the last known parameter state that does not include the point to be unlearned—rather than a random initialization").
Wang and Bourtoule are considered analogous because they are each concerned with language model training. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Wang with the teachings of Bourtoule for the purpose of achieving desired language model performance. Given that all the claimed elements were known in the prior art, one skilled in the art could have combined the elements by known methods with no change in their respective functions, and the combination would have yielded nothing more than predictable results.
Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Wang in view of "Reinforcement Unlearning" by Ye et al. (hereinafter, "Ye").
Regarding claim 12, Wang does not explicitly teach “The method of claim 4, further comprising: determining a reinforcement learning policy with a reward term that rewards the first LM for generating outputs that are statistically similar to outputs of the second auxiliary LM and a penalty term that penalizes the first LM for generating outputs that are statistically similar to outputs of the first auxiliary LM for a given input, wherein statistical similarity is determined using a first statistical similarity metric,” and thus, Ye is introduced.
Ye teaches determining a reinforcement learning policy with a reward term that rewards the first LM for generating outputs that are statistically similar to outputs of the second auxiliary LM (page 1, Introduction, “With each action taken, the agent receives a reward and updates its state, creating an experience sample used to update its policy.”) anda penalty term that penalizes the first LM for generating outputs that are statistically similar to outputs of the first auxiliary LM for a given input (page 7, Direct Reward Inversion, “As both unlearning methods focus on reducing the agent’s received reward in the unlearning environment, a seemingly straightforward approach involves directly inverting the received reward: changing a real reward r to −r.”),wherein statistical similarity is determined using a first statistical similarity metric (page 7, Environmental Inference, "During the crossover process, pairs of transition functions are selected based on their fitness, determined by their similarity to the learned optimal policy.").
Wang and Ye are considered analogous because they are each concerned with language model training. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Wang with the teachings of Ye for the purpose of achieving desired language model performance. Given that all the claimed elements were known in the prior art, one skilled in the art could have combined the elements by known methods with no change in their respective functions, and the combination would have yielded nothing more than predictable results.
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
China Invention Application CN 112529209 to Meng.
China Invention Application CN 113807455 to Kong et al.
U.S. Patent Application Publication 2022/0147864 to Chang et al.
“QUARK: Controllable Text Generation with Reinforced Unlearning” by Lu et al.
“Knowledge Unlearning for LLMs: Tasks, Methods, and Challenges” by Si et al.
“Learn to Unlearn: A Survey on Machine Unlearning” by Qu et al.
“Towards Unbounded Machine Unlearning” by Kurmanji et al.
“Towards Making Systems Forget with Machine Unlearning” by Cao et al.
“Coded Machine Unlearning” by Aldaghri et al.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SEAN T SMITH whose telephone number is (571)272-6643. The examiner can normally be reached Monday - Friday 8:00am - 5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, PIERRE-LOUIS DESIR can be reached at (571) 272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/SEAN THOMAS SMITH/Examiner, Art Unit 2659   

/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659
Read full office action
Prosecution Timeline

Mar 28, 2024
Application Filed
Dec 16, 2025
Non-Final Rejection mailed — §102, §103
Dec 30, 2025
Interview Requested
Jan 22, 2026
Applicant Interview (Telephonic)
Jan 22, 2026
Examiner Interview Summary
Feb 19, 2026
Response Filed
Mar 25, 2026
Final Rejection mailed — §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/338,033
Patent 12626056
GENERATING NATURAL LANGUAGE MODEL INSIGHTS FOR DATA CHARTS USING LIGHT LANGUAGE MODELS DISTILLED FROM LARGE LANGUAGE MODELS
2y 10m to grant Granted May 12, 2026
18/393,807
Patent 12602540
LEVERAGING A LARGE LANGUAGE MODEL ENCODER TO EVALUATE PREDICTIVE MODELS
2y 3m to grant Granted Apr 14, 2026
18/092,987
Patent 12530534
SYSTEM AND METHOD FOR GENERATING STRUCTURED SEMANTIC ANNOTATIONS FROM UNSTRUCTURED DOCUMENT
3y 0m to grant Granted Jan 20, 2026
Study what changed to get past this examiner. Based on 3 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
86%
Grant Probability
99%
With Interview (+25.0%)
2y 9m (~6m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 7 resolved cases by this examiner. Grant probability derived from career allowance rate.