Prosecution Insights
Last updated: April 19, 2026
Application No. 18/344,419

UN-LEARNING OF TRAINING DATA FOR MACHINE LEARNING MODELS

Non-Final OA §103
Filed
Jun 29, 2023
Examiner
NYE, LOUIS CHRISTOPHER
Art Unit
2141
Tech Center
2100 — Computer Architecture & Software
Assignee
Amazon Technologies, Inc.
OA Round
1 (Non-Final)
22%
Grant Probability
At Risk
1-2
OA Rounds
3y 2m
To Grant
58%
With Interview

Examiner Intelligence

Grants only 22% of cases
22%
Career Allow Rate
2 granted / 9 resolved
-32.8% vs TC avg
Strong +36% interview lift
Without
With
+35.7%
Interview Lift
resolved cases with interview
Typical timeline
3y 2m
Avg Prosecution
27 currently pending
Career history
36
Total Applications
across all art units

Statute-Specific Performance

§101
38.3%
-1.7% vs TC avg
§103
50.0%
+10.0% vs TC avg
§102
7.8%
-32.2% vs TC avg
§112
3.9%
-36.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 9 resolved cases

Office Action

§103
Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claim(s) 1-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Bourtoule et al. (NPL from IDS: Machine Unlearning, published 2021, hereinafter “Bourtoule”) in view of Houlsby et al. (NPL from IDS: Parameter-Efficient Transfer Learning for NLP, published June 2019, hereinafter “Houlsby”). Regarding claim 1, Bourtoule teaches a computer-implemented method, comprising: receiving a request to remove a data sample from a dataset (Bourtoule, Pg. 2, Col. 1, Paragraph 1 – “Finally, when a request to unlearn a training point arrives, we need to retrain only the affected model.” – teaches receiving a request to remove a data samples from a dataset), the dataset comprising a plurality of shards each corresponding to a portion of the dataset, training data in the plurality of shards used to train a respective plurality of instances of a model (Bourtoule, Pg. 2, Col. 1, Paragraph 1 – “First, we divide the training data into multiple disjoint shards such that a training point is included in one shard only; shards partition the data. Then, we train models in isolation on each of these shards, which limits the influence of a point to the model that was trained on the shard containing the point.” – teaches the dataset comprising a plurality of shards each corresponding to a portion of the dataset (divides training data into multiple disjoint shards), training data in the plurality of shards used to train a respective plurality of instances of a model (train models in isolation on each of these shards)); identifying an instance of the model that is trained using a shard that contains the data sample (Bourtoule, Fig. 2 and Pg. 2, Col. 1, Paragraph 1 – “Then, we train models in isolation on each of these shards, which limits the influence of a point to the model that was trained on the shard containing the point. Finally, when a request to unlearn a training point arrives, we need to retrain only the affected model.” – teaches identifying an instance of the model that is trained using a shard that contains the data sample (trains models in isolation on each shard, limits influence of a point to the model that was trained on the shard containing the point, thus identifying an instance of the model that is trained using a shard that contains the sample)); identifying a slice of the shard that contains the data sample, the shard comprising a plurality of slices of the training data each corresponding to a checkpoint set during training (Bourtoule, Fig. 2 and Pg. 2, Col. 1, Paragraph 1 – “In addition, rather than training each model on the entire shard directly, we can divide each shard’s data into slices and present slices incrementally during training. We save the state of model parameters before introducing each new slice, allowing us to start retraining the model from the last known parameter state that does not include the point to be unlearned—rather than a random initialization.” – teaches identifying a slice of the shard that contains the data sample, the shard comprising a plurality of slices of the training data each corresponding to a checkpoint set during training (can divide each shard’s data into slices and present slices incrementally during training, can save state of model parameters before introducing each slice allowing to retrain from a checkpoint set during training that does not include the identified point to be unlearned within the slice of the shard)); removing the data sample from the identified slice of the dataset (Bourtoule, Fig. 1 and Pg. 2, Col. 1, Paragraph 1 – “We save the state of model parameters before introducing each new slice, allowing us to start retraining the model from the last known parameter state that does not include the point to be unlearned—rather than a random initialization.” – teaches removing the data sample from the identified slice of the dataset (retrains with last known parameter state that does not include the point to be unlearned, thus removing the data sample from the identified slice of the dataset, and further in Fig. 1, shows removing data sample from identified slice of the dataset)); retraining the identified instance of the model starting from the checkpoint that was most recently set before the identified instance was trained using the data in the slice (Bourtoule, Pg. 2, Col. 1, Paragraph 1 – “We save the state of model parameters before introducing each new slice, allowing us to start retraining the model from the last known parameter state that does not include the point to be unlearned—rather than a random initialization.” – teaches starting from the checkpoint (save of state of model parameters) that most recently set before the identified instance was trained using the data in the slice (retraining from last known parameter state that does not include the point to be unlearned)); and providing the retrained instance with the other instances, of the plurality of instances of the model, to generate a plurality of inferences to be used to generate a consensus inference output (Bourtoule, Fig. 2 and Pg. 2, Col. 1, Paragraph 1 – “At inference, we use different strategies to aggregate the predictions of models trained on each shard: the simplest one is a majority vote over predicted labels.” – teaches instances of the plurality of instances (models trained on each shard) to generate a plurality of inferences (predictions) to be used to generate a consensus inference output (majority vote over predicted labels)). Bourtoule fails to explicitly teach a [[the]] language model; retraining the identified instance of the language model using a set of adapter weights. However, analogous to the field of the claimed invention, Houlsby teaches: a [[the]] language model (Houlsby, Fig. 2 and Pg. 2, Col. 2, Last Paragraph – We instantiate adapter-based tuning for text Transformers. These models attain state-of-the-art performance in many NLP tasks, including translation, extractive QA, and text classification problems” – teaches a language model); retraining the identified instance of the language model using a set of adapter weights (Houlsby, Pg. 2, Col. 2, Paragraph 3 – “Adapter modules perform more general architectural modifications to re-purpose a pre trained network for a downstream task. In particular, the adapter tuning strategy involves injecting new layers into the original network. The weights of the original network are untouched, whilst the new adapter layers are initialized at random. In standard fine-tuning, the new top-layer and the original weights are co-trained. In contrast, in adapter tuning, the parameters of the original network are frozen and therefore may be shared by many tasks.” and in Fig. 2 Description – “During adapter tuning, the green layers are trained on the downstream data, this includes the adapter, the layer normalization parameters, and the final classification layer” – teaches training the identified instance of the language model using a set of adapter weights (adapter modules initialize adapter layers with adapter weights, trains identified instance of language model using adapter weights)) Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate the language model and adapter weights of Houlsby to the shards, slices, retraining, and unlearning of Bourtoule in order to unlearn a data sample from a language model. Doing so would meet the right to be forgotten requirements, thus achieving erasure of data from ML models (Bourtoule, Introduction), would allow the model to be extended to new tasks without affecting previous ones, and would provide parameter-efficient tuning for NLP (Houlsby, Introduction). Claims 6-7 and 16-17 incorporate substantively all the limitations of claim 1 in a computer-implemented method and a system, and are rejected on the same grounds as above. Bourtoule teaches a processor and a memory at Pg. 10, Col. 1, Paragraph 2 – “We run our experiments using P100 and T4 Nvidia GPUs, with 12 and 16 GB of dedicated memory, respectively. We use Intel Xeon Silver 4110 CPUs with 8 cores each and 192GB of Ram”. Regarding claim 2, the combination of Bourtoule and Houlsby teaches the computer-implemented method of claim 1, wherein generating the updated inference output further comprising: determining the consensus inference output based on a majority vote based on the plurality of inferences (Bourtoule, Pg. 2, Col. 1, Paragraph 1 – “At inference, we use different strategies to aggregate the predictions of models trained on each shard: the simplest one is a majority vote over predicted labels.” – teaches determining the consensus inference output based on a majority vote based on the plurality of inferences (performs majority vote over predicted labels produced by the plurality of models trained on each shard)). Claims 8 and 18 are similar to claim 2, hence similarly rejected. Regarding claim 3, the combination of Bourtoule and Houlsby teaches the computer-implemented method of claim 1, wherein data samples are positioned in the slices of a shard of the dataset based at least in part on a likelihood that a request will be received to remove the data samples from the dataset (Bourtoule, Pg. 13, Col. 2, Paragraph 3 – “We discuss one such approach in Algorithm 1, under the following assumptions: (a) the distribution of unlearning requests is known precisely, and (b) this distribution is relatively constant over a time interval. Recall that each data point du ∈ D has an associated probability p(u) with which it may be erased. We first sort the data points in the order of their erasure probability, and points to a shard Di till the desired value of E(Di) is reached.” – teaches wherein data samples are position in the slices of a shard of the dataset based at least in part on a likelihood that a request will be received to remove the data samples from the dataset). Regarding claim 4, the combination of Bourtoule and Houlsby teaches the computer-implemented method of claim 1, wherein the plurality of instances of the language model are trained using a set of the adapter weights and a set of base weights (Houlsby, Pg. 2, Col. 2, Paragraph 1 – “We present a strategy for tuning a large text model on several downstream tasks.”, Fig. 2, and in Pg. 2, Col. 2, Paragraph 3 – “Adapter modules perform more general architectural modifications to re-purpose a pre trained network for a downstream task. In particular, the adapter tuning strategy involves injecting new layers into the original network. The weights of the original network are untouched, whilst the new adapter layers are initialized at random. In standard fine-tuning, the new top-layer and the original weights are co-trained. In contrast, in adapter tuning, the parameters of the original network are frozen and therefore may be shared by many tasks.” – teaches wherein the language model (large text model) is trained (trains by only tuning adapter weights and keeping base weights frozen) using a set of adapter weights (adapter weights of adapter layers injected into network) and a set of base weights (original weights)). Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate the language model, adapter weights, and base weights of Houlsby to the plurality of instances of Bourtoule in order to train a plurality of instances of the language model on a set of adapter weights and base weights. Doing so would provide a method for parameter-efficient tuning for NLP and permit sequential training of a plurality of models (Houlsby, Introduction). Claim 14 is similar to claim 4, hence similarly rejected. Regarding claim 5, the combination of Bourtoule and Houlsby teaches the computer-implemented method of claim 4, wherein a subset of the weights is stored for each slice (Bourtoule, Fig. 2 Description – “One constituent model is trained on each shard by presenting it with incrementally many slices and saving its parameters before the training set is augmented with a new slice. When data needs to be unlearned, only one of the constituent models whose shards contains the point to be unlearned needs to be retrained — retraining can start from the last parameter values saved before including the slice containing the data point to be unlearned.” – teaches wherein a subset of weights is stored for each slice (parameters are saved before training is augmented with a new slice of the shard)) Bourtoule fails to explicitly teach the adapter weights and the instance of the language model is retrained using a respective set of the adapter weights without modifying the base weights. However, analogous to the field of the claimed invention, Houlsby teaches: the adapter weights (Houlsby, Pg. 2, Col. 2, Paragraph 3 – “Adapter modules perform more general architectural modifications to re-purpose a pre trained network for a downstream task. In particular, the adapter tuning strategy involves injecting new layers into the original network. The weights of the original network are untouched, whilst the new adapter layers are initialized at random. In standard fine-tuning, the new top-layer and the original weights are co-trained. In contrast, in adapter tuning, the parameters of the original network are frozen and therefore may be shared by many tasks.” and in Fig. 2 Description – “During adapter tuning, the green layers are trained on the downstream data, this includes the adapter, the layer normalization parameters, and the final classification layer” – teaches the adapter weights) and the instance of the language model is retrained using a respective set of the adapter weights without modifying the base weights (Houlsby, Pg. 2, Col. 2, Paragraph 1 – “We present a strategy for tuning a large text model on several downstream tasks.”, Fig. 2, and in Pg. 2, Col. 2, Paragraph 3 – “Adapter modules perform more general architectural modifications to re-purpose a pre trained network for a downstream task. In particular, the adapter tuning strategy involves injecting new layers into the original network. The weights of the original network are untouched, whilst the new adapter layers are initialized at random. In standard fine-tuning, the new top-layer and the original weights are co-trained. In contrast, in adapter tuning, the parameters of the original network are frozen and therefore may be shared by many tasks.” – teaches wherein the language model (large text model) is trained (trains by only tuning adapter weights and keeping base weights frozen) using a set of adapter weights (adapter weights of adapter layers injected into network) without modifying the base weights (original weights)). Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate the language model and adapter weights of Houlsby to the shards, slices, checkpoints, and retraining of Bourtoule in order to retrain the language model using a set of adapter weights and base weights. Doing so would allow the model to be extended to new tasks without affecting previous ones, and would provide parameter-efficient tuning for NLP (Houlsby, Introduction). Claim 15 is similar to claim 5, hence similarly rejected. Claim 20 is similar to claims 4 and 5, hence similarly rejected. Regarding claim 9, the combination of Bourtoule and Houlsby teaches the computer-implemented method of claim 7, wherein each slice corresponds to a checkpoint set after training of a respective instance of the model (Bourtoule, Pg. 2, Col. 1, Paragraph 1 – “In addition, rather than training each model on the entire shard directly, we can divide each shard’s data into slices and present slices incrementally during training. We save the state of model parameters before introducing each new slice, allowing us to start retraining the model from the last known parameter state that does not include the point to be unlearned—rather than a random initialization.” – teaches the shard comprising a plurality of slices of the training data each corresponding to a checkpoint set during training a respective instance of the model (can divide each shard’s data into slices and present slices incrementally during training, can save state of model parameters before introducing each slice allowing to retrain from a checkpoint set during training)). Bourtoule fails to explicitly teach the language model. However, analogous to the field of the claimed invention, Houlsby teaches: the language model (Houlsby, Fig. 2 and Pg. 2, Col. 2, Last Paragraph – We instantiate adapter-based tuning for text Transformers. These models attain state-of-the-art performance in many NLP tasks, including translation, extractive QA, and text classification problems” – teaches a language model); Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate the language model of Houlsby to the shards, slices, and training checkpoints of Bourtoule in order to set checkpoints in training a language model. Doing so would provide parameter-efficient tuning for NLP and enable the model to have perfect memory of previous tasks using a small number of task-specific parameters (Houlsby, Introduction). Regarding claim 10, the combination of Bourtoule and Houlsby teaches the computer-implemented method of claim 9, further comprising: determining a slice that contains the data sample to be removed, the slice corresponding to a checkpoint (Bourtoule, Fig. 1 and Pg. 2, Col. 1, Paragraph 1 – “We save the state of model parameters before introducing each new slice, allowing us to start retraining the model from the last known parameter state that does not include the point to be unlearned—rather than a random initialization.” – teaches determining a slice that contains the data sample to be removed, the slice corresponding to a checkpoint (saves state of model parameters before introducing each slice, thus each slice corresponds to a checkpoint, and retrains from last known state that does not include point to be unlearned, thus determining the slice that contains the data sample to be removed)) ; removing the data sample from the slice (Bourtoule, Fig. 1 and Pg. 2, Col. 1, Paragraph 1 – “We save the state of model parameters before introducing each new slice, allowing us to start retraining the model from the last known parameter state that does not include the point to be unlearned—rather than a random initialization.” – teaches removing the data sample from the slice (retrains with last known parameter state that does not include the point to be unlearned, thus removing the data sample from the slice, and further in Fig. 1, shows removing data sample from slice)); and retraining the instance of the model from a checkpoint that was most recently set before the slice was used to train the identified instance (Bourtoule, Fig. 1 and Pg. 2, Col. 1, Paragraph 1 – “We save the state of model parameters before introducing each new slice, allowing us to start retraining the model from the last known parameter state that does not include the point to be unlearned—rather than a random initialization.” – teaches retraining the instance of the model from a checkpoint that was most recently set before the slice was used to train the identified instance). Bourtoule fails to explicitly teach the language model. However, analogous to the field of the claimed invention, Houlsby teaches: the language model (Houlsby, Fig. 2 and Pg. 2, Col. 2, Last Paragraph – We instantiate adapter-based tuning for text Transformers. These models attain state-of-the-art performance in many NLP tasks, including translation, extractive QA, and text classification problems” – teaches a language model); Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate the language model of Houlsby to the shards, slices, and training checkpoints of Bourtoule in order to set checkpoints in training a language model. Doing so would provide parameter-efficient tuning for NLP and enable the model to have perfect memory of previous tasks using a small number of task-specific parameters (Houlsby, Introduction). Claim 19 is similar to claims 9 and 10, hence similarly rejected. Regarding claim 11, the combination of Bourtoule and Houlsby teaches the computer-implemented method of claim 7, wherein the data samples are positioned in the slices of a shard of the dataset based on a determined ranking of the data samples (Bourtoule, Pg. 13, Col. 2, Paragraph 3 – “We discuss one such approach in Algorithm 1, under the following assumptions: (a) the distribution of unlearning requests is known precisely, and (b) this distribution is relatively constant over a time interval. Recall that each data point du ∈ D has an associated probability p(u) with which it may be erased. We first sort the data points in the order of their erasure probability, and points to a shard Di till the desired value of E(Di) is reached.” – teaches wherein data samples are positioned in the slices of a shard of the dataset based on a determined ranking of the data samples (sorts the data points in order of their erasure probability)). Regarding claim 12, the combination of Bourtoule and Houlsby teaches the computer-implemented method of claim 11 wherein data samples with a higher likelihood of being removed from the dataset are placed in slices used for training after data samples with a lower likelihood of being removed (Bourtoule, Algorithm 1 and Pg. 13, Col. 2, Paragraph 3 – “We first sort the data points in the order of their erasure probability, and points to a shard Di till the desired value of E(Di) is reached. Once this value is exceeded, we create a new shard Di+1 and restart the procedure with the residual data D\Di8. By enforcing a uniform cumulative probability of unlearning a across shards, Algorithm 1 naturally aggregates the training points that are likely to require unlearning into a fewer shards that are also smaller in size.” – teaches wherein data samples with a higher likelihood of being removed from the dataset are placed in slices used for training after data samples with a lower likelihood of being removed (creates new, empty shard and populates shard with samples of highest erasure probability by removing the lowest probability samples, thus the samples of higher erasure probability are placed in slices after samples with lower likelihood)). Regarding claim 13, the combination of Bourtoule and Houlsby teaches the computer-implemented method of claim 11 wherein data samples associated with a higher determined importance are placed in slices used for training before data samples associated with a lower determined importance (Bourtoule, Algorithm 1 and Pg. 13, Col. 2, Paragraph 3 – “We first sort the data points in the order of their erasure probability, and points to a shard Di till the desired value of E(Di) is reached. Once this value is exceeded, we create a new shard Di+1 and restart the procedure with the residual data D\Di8. By enforcing a uniform cumulative probability of unlearning a across shards, Algorithm 1 naturally aggregates the training points that are likely to require unlearning into a fewer shards that are also smaller in size.” – teaches wherein data samples with a higher importance are placed in slices used for training before data samples with a lower importance (creates new, empty shard and populates shard with samples of highest probability by removing the lowest probability samples, thus the samples of higher importance are placed in slices before samples with lower importance)). Conclusion The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Brophy et al. (NPL: Exit Through the Training Data: A Look into Instance-Attribution Explanations and Efficient Data Deletion in Machine Learning, published Sept. 2020) teaches methods for efficient data deletion in machine learning models. Methods include Leave-One-Out retraining wherein one or more instances are removed from training data and retrains the model on the remaining instances. Enayat et al. (US Pub. No. 2023/0118785, filed Oct. 2021) teaches systems and methods for unlearning training data of a neural network, wherein data may be inserted or deleted upon request and a gradient is computed based upon the addition or removal of data. If the gradient meets a stochastic condition then the neural network is retrained to obtain a modified neural network. Teaches applying machine unlearning to language models. Tahiliani et al. (NPL: Machine Unlearning: Its Need and Implementation Strategies, published Nov. 2021) teaches methods for machine unlearning wherein data may be user activities on applications and platforms and is sharded and sliced. Teaches wherein two sets of weights are trained, one set of weights corresponding to the core dataset and second set of weights corresponding to a changeable user data set. Any inquiry concerning this communication or earlier communications from the examiner should be directed to LOUIS C NYE whose telephone number is 571-272-0636. The examiner can normally be reached Monday - Friday 9:00AM - 5:00PM. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, MATT ELL can be reached at 571-270-3264. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /LOUIS CHRISTOPHER NYE/Examiner, Art Unit 2141 /MATTHEW ELL/Supervisory Patent Examiner, Art Unit 2141
Read full office action

Prosecution Timeline

Jun 29, 2023
Application Filed
Mar 02, 2026
Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12524683
METHOD FOR PREDICTING REMAINING USEFUL LIFE (RUL) OF AERO-ENGINE BASED ON AUTOMATIC DIFFERENTIAL LEARNING DEEP NEURAL NETWORK (ADLDNN)
2y 5m to grant Granted Jan 13, 2026
Study what changed to get past this examiner. Based on 1 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

1-2
Expected OA Rounds
22%
Grant Probability
58%
With Interview (+35.7%)
3y 2m
Median Time to Grant
Low
PTA Risk
Based on 9 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month