Last updated: May 29, 2026
Application No. 17/893,241
LAYER FREEZING & DATA SIEVING FOR SPARSE TRAINING

Non-Final OA §101§103
Filed
Aug 23, 2022
Examiner
COULSON, JESSE CHEN
Art Unit
2122
Tech Center
2100 — Computer Architecture & Software
Assignee
Snap Inc.
OA Round
3 (Non-Final)
Interview Optional

— +50.0% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 17% grant rate with +50.0% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 6 resolved cases, 2023–2026
Examiner Intelligence

COULSON, JESSE CHEN View full profile →
Grants only 17% of cases
Career Allowance Rate
1 granted / 6 resolved
-38.3% vs TC avg
Strong +50% interview lift
Without
With
+50.0%
Interview Lift
resolved cases with interview
Typical timeline
3y 6m
Avg Prosecution
12 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
2.3%
-37.7% vs TC avg
§103
84.1%
+44.1% vs TC avg
§102
13.6%
-26.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 6 resolved cases
Office Action

§101 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 2/02/2026 has been entered.
Claims 1, 11, and 18 have been amended. Claims 1-7 and 9-20 are pending and have been examined.
Claim Rejections - 35 USC § 101
The rejections under 35 USC § 101 to claims 1-7 and 9-20 are WITHDRAWN in view of Applicant’s arguments and amendments to Claims 1, 11, 18.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Brock et al. (“FreezeOut: Accelerate Training by Progressively Freezing Layers”), from applicant IDS, hereinafter “Brock”, in view of Yuan et al. (“MEST: Accurate and Fast Memory-Economic Sparse Training Framework on the Edge”), from applicant IDS, hereinafter “Yuan”, further in view of Cheng et al. (“A Simple Non-i.i.d. Sampling Approach for Efficient Training and Better Generalization”), hereinafter “Cheng”.


Regarding Claim 1, Brock teaches:
A framework for training an electronic… network, the framework comprising a processor (Brock, pg. 2, 3rd paragraph, “FreezeOut is easy to implement in a dynamic graph framework, requiring approximately 15 unique lines of code in PyTorch”, pg. 3, 3rd paragraph, “For the speedups presented here, we use the values attained by running each test sweep in a controlled environment, on a GTX1080Ti with no other programs running.” Brock performs their method on a computer, in which processor, memory, and storage devices are inherent.):
	actively train all layers of the… network using a… training dataset comprised of training samples (Brock, pg. 1, 2nd paragraph, “training a network by only training each layer for a set portion of the training schedule”, Brock, pg. 3, Fig 2, Model is trained on training samples from CIFAR-100 dataset);
progressively freeze the layers of the sparse network in a sequential manner to obtain a trained sparse network (Brock, pg. 1, 2nd paragraph, “progressively "freezing out" layers and excluding them from the backward pass.”).
	Brock does not expressly teach
	a sparse network
using a partial training dataset
initialize the sparse network, the sparse network having a sparse structure 
data sieve the training samples to update the partial training dataset
randomly selecting a percentage of total training samples of a training dataset to create the partial training dataset and a removed dataset;
training the sparse network for every epoch using only the partial training dataset and not the removed dataset; 
updating the partial training dataset for every epoch by removing a number of the training samples from the partial training dataset and adding the removed training samples to the removed dataset, wherein for every epoch, a current partial training dataset is updated by removing an easiest percentage of the training sample from the partial training dataset and adding them to the removed dataset to reduce training floating point operations per second (FLOPs) to accelerate training of the electronic sparse network; and 
retrieving the same number of removed training samples from the removed dataset and adding the retrieved training samples back to the partial training dataset to keep the total number of training samples in the partial training dataset unchanged.
However, Yuan teaches 
a sparse network and initialize the sparse network, the sparse network having a sparse structure (Yuan, pg. 2, 1st paragraph, “they start with a sparse model structure picked intuitively… explore various sparse topologies”); 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Yuan starting with a sparse network and the layer freezing taught by Brock. The modification would have been motivated because by starting with a sparse network it would make training of the network more efficient (Yuan, pg. 2, 1st paragraph, “The underlying principle of sparse training is that the total epoch number is the same as dense training, but the speed of each training iteration (batch) is significantly improved”)
	Brock in view of Yuan do not expressly teach 
using a partial training dataset
data sieve the training samples to update the partial training dataset
randomly selecting a percentage of total training samples of a training dataset to create the partial training dataset and a removed dataset; 
training the sparse network for every epoch using only the partial training dataset and not the removed dataset; 
updating the partial training dataset for every epoch by removing a number of the training samples from the partial training dataset and adding the removed training samples to the removed dataset, wherein for every epoch, a current partial training dataset is updated by removing an easiest percentage of the training sample from the partial training dataset and adding them to the removed dataset to reduce training floating point operations per second (FLOPs) to accelerate training of the electronic sparse network; and 
retrieving the same number of removed training samples from the removed dataset and adding the retrieved training samples back to the partial training dataset to keep the total number of training samples in the partial training dataset unchanged.
	However, Cheng teaches 
using a partial training dataset (Partial dataset is non discarded samples, Cheng, p. 2, Figure 1 description, “all the available training examples (orange dots) are adopted for training… as the training goes on, we gradually discard training examples”)
data sieve the training samples to update the partial training dataset(Cheng, p. 1, Abstract, “selectively drop easy samples and refresh them only periodically”)
randomly(Creating the datasets from removing samples is based on lowest losses which depends on random augmentations, Cheng, p.2, Figure 1 description, “Samples are removed by dropping examples with lowest losses and the discarded samples might be different in different training cycles”, p. 5, col. 2, paragraph 3, “We adopt standard data augmentations (random crop and random flip)”) selecting a percentage of total training samples of a training dataset to create the partial training dataset and a removed dataset (removed dataset is examples not kept, Cheng, p. 4, col. 2, paragraph 4, “a keep rate (p)… keep p% of the current examples”);
training the sparse network for every epoch using only the partial training dataset and not the removed dataset (Cheng, p. 2, Figure 1 description, “First, all the available training examples (orange dots) are adopted for training at the beginning of the training cycle j. Second, as the training goes on, we gradually discard training examples (gray) to save training computation”);  
updating the partial training dataset for every epoch by removing a number of the training samples from the partial training dataset and adding the removed training samples to the removed dataset (Cheng, p. 4, col. 2, paragraph 4, “subsample examples every I epochs”, Figure 1 shows data dropped per epoch, “as the training goes on, we gradually discard training examples (gray)”)), wherein for every epoch, a current partial training dataset is updated by removing an easiest percentage of the training sample from the partial training dataset and adding them to the removed dataset (easiest percentage of non kept examples is removed, Cheng, p. 4, col. 2, paragraph 4, “keep p% of the current examples”, Cheng, p. 7, col. 2, ¶2, “progressively drop easy examples”) to reduce training floating point operations per second (FLOPs) to accelerate training of the electronic sparse network (Cheng, p. 2, col. 1, paragraph 1, “our proposed method only requires 83%, 92%, and 86% training computation to achieve comparable or better results”); and 
retrieving the same number of removed training samples from the removed dataset and adding the retrieved training samples back to the partial training dataset to keep the total number of training samples in the partial training dataset unchanged (Every cycle removed training samples are added back, Cheng, p. 4, col. 2, paragraph 5, “we reuse all training examples (the “refresh” stage) followed by another “drop” stage”,  p. 2, Figure 1 description, “all the discarded samples are refreshed for training when a new cycle k is started”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Cheng data sieving Drop-and-Refresh method and the network training taught by Brock. The motivation would to do so would be to save training computation while retaining performance (Cheng, p. 2, col. 2, paragraph 2, “We present an efficient sampling method named Drop-and-Refresh (DaR) to sample a subset of non-i.i.d. training data that can save training computation while retaining the performance”).

Regarding Claim 2, Brock in view of Yuan and Cheng teaches the framework of Claim 1 as referenced above. In the combination as set forth above in Claim 1, Cheng teaches:
obtain the partial training dataset by randomly removing a percentage of training samples from a whole training dataset (Creating the partial dataset from removing samples based on lowest losses which depends on random augmentations, p.2, Figure 1 description, “Samples are removed by dropping examples with lowest losses and the discarded samples might be different in different training cycles”, p. 5, col. 2, paragraph 3, “We adopt standard data augmentations (random crop and random flip)”, p. 4, col. 2, paragraph 4, “a keep rate (p)… keep p% of the current examples”).

Regarding Claim 3, Brock in view of Yuan and Cheng teaches the framework of Claim 1 as referenced above. Brock further teaches:
wherein a layer is frozen only if all the layers in front of the layer are frozen (Brock, pg. 1, 4th paragraph, “where the first layer’s learning rate is reduced to zero partway through training (at t0), and each subsequent layer’s learning rate is annealed to zero some set time thereafter.”, Brock pg. 2, Fig 1, the layers reach 0% of initial learning rate in sequential order).

Regarding Claim 4, Brock in view of Yuan and Cheng teaches the framework of Claim 1 as referenced above. Brock further teaches:
wherein the sparse structure and weight values of the frozen layers remain unchanged (Brock, pg. 1, 4th paragraph, “learning rate is annealed to zero”, and “Once a layer’s learning rate reaches zero, we put it in inference mode and exclude it from all future backward passes”, Weight values are unchanged because the frozen layers are excluded from backwards passes).

Regarding Claim 5, Brock in view of Yuan and Cheng teaches the framework of Claim 4 as referenced above. Brock further teaches:
wherein all gradients of weights and gradients of activations in the frozen layers are eliminated (Brock, pg. 1, 4th paragraph, “learning rate is annealed to zero”, and “Once a layer’s learning rate reaches zero, we put it in inference mode and exclude it from all future backward passes”, Frozen layers are excluded from backwards passes therefore gradients are not adjusted, effectively eliminating them).

Regarding Claim 6, Brock in view of Yuan and Cheng teaches the framework of Claim 1 as referenced above. Cheng further teaches:
wherein the data sieving decreases a number of training iterations in each epoch (Dropping samples decreases the number of mini-batches which lowers iterations per epoch, p. 2, col. 1, paragraph 2, “As the training process goes on, we gradually feed the network with only a subset of the training examples”, p. 4, col. 2, paragraph 5, “we keep p% of the hardest examples every I epochs (the “drop” stage)”, p. 4, Algorithm 1, “for every batch b do… backward(Lb)… Update model weights”, p. 2, col. 2, paragraph 1, “our proposed method only requires 83%, 92%, and 86% training computation”).

Regarding Claim 7, Brock in view of Yuan and Cheng teaches the framework of Claim 1 as referenced above. Cheng further teaches:
wherein the data sieving comprises circular data sieving (p. 2, col. 1, paragraph 2, “In each cycle, the network first learns from all training examples… we gradually feed the network with only a subset of the training examples… we periodically repeat this training cycle where all training data is used again and restart the process of reducing training data for each training epoch”).

Regarding Claim 9, Brock in view of Yuan and Cheng teaches the framework of Claim 1 as referenced above. Yuan further teaches:
wherein the processor is configured to actively train the layers by applying Dynamic Sparse Training (DST) from Memory- Economic Sparse Training (MEST) (Yuan, pg. 4, 3rd paragraph, “the MEST method (vanilla) to periodically remove less important non-zero weights from the sparse model and grow zero weights back during the sparse training process, which we call mutation”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Yuan training the layers with Dynamic Sparse Training from Memory-Economic Sparse Training and the network training taught by Brock. The modification would have been motivated because by using dynamic sparse training to help with resource limitations of edge devices and achieving high sparse training acceleration (Yuan, pg. 4, 3rd paragraph, “MEST framework is designed for the following objectives: 1) towards end-to-end memory-economic training by considering the resource limitation of edge devices; 2) Exploiting sparsity schemes to achieve high sparse training acceleration while maintaining high accuracy)”.

Regarding Claim 10, Brock in view of Yuan and Cheng teaches the framework of Claim 9 as referenced above. Brock in view of Yuan further teaches:
wherein the processor is configured to combine a layer freezing interval(Brock, pg. 1, 4th paragraph, “a layer-wise schedule, where the first layer’s learning rate is reduced to zero partway through training (at t0), and each subsequent layer’s learning rate is annealed to zero some set time thereafter”) with a DST interval(Yuan, pg. 4, 3rd paragraph, “the MEST method (vanilla) to periodically remove less important non-zero weights from the sparse model and grow zero weights back during the sparse training process”).

Regarding Claim 11, Brock teaches:
A method of using a framework having a processor (Brock, pg. 2, 3rd paragraph, “FreezeOut is easy to implement in a dynamic graph framework, requiring approximately 15 unique lines of code in PyTorch”, pg. 3, 3rd paragraph, “For the speedups presented here, we use the values attained by running each test sweep in a controlled environment, on a GTX1080Ti with no other programs running.” Brock performs their method on a computer, in which processor, memory, and storage devices are inherent) configured to train an electronic… network having a sparse structure, the method comprising:
	actively training all layers of the… network using a… training dataset comprised of training samples (Brock, pg. 1, 2nd paragraph, “training a network by only training each layer for a set portion of the training schedule”, Brock, pg. 3, Fig 2, Model is trained on training samples from CIFAR-100 dataset);
progressively freezing the layers of the sparse network in a sequential manner to obtain a trained sparse network (Brock, pg. 1, 2nd paragraph, “progressively "freezing out" layers and excluding them from the backward pass.”).
	Brock does not expressly teach 
	a sparse network
using a partial training dataset
initializing the sparse network
data sieving the training samples to update the partial training dataset
randomly selecting a percentage of total training samples of a training dataset to create the partial training dataset and a removed dataset;
training the sparse network for every epoch using only the partial training dataset and not the removed dataset; 
updating the partial training dataset for every epoch by removing a number of the training samples from the partial training dataset and adding the removed training samples to the removed dataset, wherein for every epoch, a current partial training dataset is updated by removing an easiest percentage of the training sample from the partial training dataset and adding them to the removed dataset to reduce training floating point operations per second (FLOPs) to accelerate training of the electronic sparse network; and 
retrieving the same number of removed training samples from the removed dataset and adding the retrieved training samples back to the partial training dataset to keep the total number of training samples in the partial training dataset unchanged.
However, Yuan teaches 
a sparse network and initializing the sparse network (Yuan, pg. 2, 1st paragraph, “they start with a sparse model structure picked intuitively… explore various sparse topologies”); 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Yuan starting with a sparse network and the layer freezing taught by Brock. The modification would have been motivated because by starting with a sparse network it would make training of the network more efficient (Yuan, pg. 2, 1st paragraph, “The underlying principle of sparse training is that the total epoch number is the same as dense training, but the speed of each training iteration (batch) is significantly improved”)
	Brock in view of Yuan do not expressly teach 
using a partial training dataset
data sieving the training samples to update the partial training dataset
randomly selecting a percentage of total training samples of a training dataset to create the partial training dataset and a removed dataset;
training the sparse network for every epoch using only the partial training dataset and not the removed dataset;  
updating the partial training dataset for every epoch by removing a number of the training samples from the partial training dataset and adding the removed training samples to the removed dataset, wherein for every epoch, a current partial training dataset is updated by removing an easiest percentage of the training sample from the partial training dataset and adding them to the removed dataset to reduce training floating point operations per second (FLOPs) to accelerate training of the electronic sparse network; and 
retrieving the same number of removed training samples from the removed dataset and adding the retrieved training samples back to the partial training dataset to keep the total number of training samples in the partial training dataset unchanged.
	However, Cheng teaches 
using a partial training dataset (Partial dataset is non discarded samples, Cheng, p. 2, Figure 1 description, “all the available training examples (orange dots) are adopted for training… as the training goes on, we gradually discard training examples”)
data sieving the training samples to update the partial training dataset(Cheng, p. 1, Abstract, “selectively drop easy samples and refresh them only periodically”)
randomly(Creating the datasets from removing samples is based on lowest losses which depends on random augmentations, Cheng, p. 2, Figure 1 description, “Samples are removed by dropping examples with lowest losses and the discarded samples might be different in different training cycles”, p. 5, col. 2, paragraph 3, “We adopt standard data augmentations (random crop and random flip)”) selecting a percentage of total training samples of a training dataset to create the partial training dataset and a removed dataset (removed dataset is examples not kept, Cheng, p. 4, col. 2, paragraph 4, “a keep rate (p)… keep p% of the current examples”);
training the sparse network for every epoch using only the partial training dataset and not the removed dataset (Cheng, p. 2, Figure 1 description, “First, all the available training examples (orange dots) are adopted for training at the beginning of the training cycle j. Second, as the training goes on, we gradually discard training examples (gray) to save training computation”);  
updating the partial training dataset for every epoch by removing a number of the training samples from the partial training dataset and adding the removed training samples to the removed dataset (Cheng, p. 4, col. 2, paragraph 4, “subsample examples every I epochs”, Figure 1 shows data dropped per epoch, “as the training goes on, we gradually discard training examples (gray)”)), wherein for every epoch, a current partial training dataset is updated by removing an easiest percentage of the training sample from the partial training dataset and adding them to the removed dataset (easiest percentage of non kept examples is removed, Cheng, p. 4, col. 2, paragraph 4, “keep p% of the current examples”, Cheng, p. 7, col. 2, ¶2, “progressively drop easy examples”) to reduce training floating point operations per second (FLOPs) to accelerate training of the electronic sparse network (Cheng, p. 2, col. 1, paragraph 1, “our proposed method only requires 83%, 92%, and 86% training computation to achieve comparable or better results”); and 
retrieving the same number of removed training samples from the removed dataset and adding the retrieved training samples back to the partial training dataset to keep the total number of training samples in the partial training dataset unchanged (Every cycle removed training samples are added back, Cheng, p. 4, col. 2, paragraph 5, “we reuse all training examples (the “refresh” stage) followed by another “drop” stage”,  p. 2, Figure 1 description, “all the discarded samples are refreshed for training when a new cycle k is started”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Cheng data sieving Drop-and-Refresh method and the network training taught by Brock. The motivation would to do so would be to save training computation while retaining performance (Cheng, p. 2, col. 2, paragraph 2, “We present an efficient sampling method named Drop-and-Refresh (DaR) to sample a subset of non-i.i.d. training data that can save training computation while retaining the performance”).


Regarding Claim 12, the rejection of Claim 11 is incorporated and further, the claim is rejected for the same reasons as set forth in Claim 2.

Regarding Claim 13, the rejection of Claim 11 is incorporated and further, the claim is rejected for the same reasons as set forth in Claim 3.

Regarding Claim 14, the rejection of Claim 11 is incorporated and further, the claim is rejected for the same reasons as set forth in Claim 4.

Regarding Claim 15, the rejection of Claim 14 is incorporated and further, the claim is rejected for the same reasons as set forth in Claim 5.

Regarding Claim 16, the rejection of Claim 11 is incorporated and further, the claim is rejected for the same reasons as set forth in Claim 6.

Regarding Claim 17, the rejection of Claim 11 is incorporated and further, the claim is rejected for the same reasons as set forth in Claim 7.

Regarding Claim 18, Brock teaches:
A non-transitory computer readable medium storing program code, which when executed, is operative to cause a processor (Brock, pg. 2, 3rd paragraph, “FreezeOut is easy to implement in a dynamic graph framework, requiring approximately 15 unique lines of code in PyTorch”, pg. 3, 3rd paragraph, “For the speedups presented here, we use the values attained by running each test sweep in a controlled environment, on a GTX1080Ti with no other programs running.” Brock performs their method on a computer, in which processor, memory, and storage devices are inherent) of a framework to train an electronic… network having a sparse structure, the method comprising:
	actively training all layers of the… network using a… training dataset comprised of training samples (Brock, pg. 1, 2nd paragraph, “training a network by only training each layer for a set portion of the training schedule”, Brock, pg. 3, Fig 2, Model is trained on training samples from CIFAR-100 dataset);
progressively freezing the layers of the sparse network in a sequential manner to obtain a trained sparse network (Brock, pg. 1, 2nd paragraph, “progressively "freezing out" layers and excluding them from the backward pass.”).
	Brock does not expressly teach 
	a sparse network
using a partial training dataset
initializing the sparse network
data sieving the training samples to update the partial training dataset
randomly selecting a percentage of total training samples of a training dataset to create the partial training dataset and a removed dataset;
training the sparse network for every epoch using only the partial training dataset and not the removed dataset; 
updating the partial training dataset for every epoch by removing a number of the training samples from the partial training dataset and adding the removed training samples to the removed dataset, wherein for every epoch, a current partial training dataset is updated by removing an easiest percentage of the training sample from the partial training dataset and adding them to the removed dataset to reduce training floating point operations per second (FLOPs) to accelerate training of the electronic sparse network; and 
retrieving the same number of removed training samples from the removed dataset and adding the retrieved training samples back to the partial training dataset to keep the total number of training samples in the partial training dataset unchanged.
However, Yuan teaches 
a sparse network and initializing the sparse network (Yuan, pg. 2, 1st paragraph, “they start with a sparse model structure picked intuitively… explore various sparse topologies”); 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Yuan starting with a sparse network and the layer freezing taught by Brock. The modification would have been motivated because by starting with a sparse network it would make training of the network more efficient (Yuan, pg. 2, 1st paragraph, “The underlying principle of sparse training is that the total epoch number is the same as dense training, but the speed of each training iteration (batch) is significantly improved”)
	Brock in view of Yuan do not expressly teach 
using a partial training dataset
data sieving the training samples to update the partial training dataset
randomly selecting a percentage of total training samples of a training dataset to create the partial training dataset and a removed dataset;
training the sparse network for every epoch using only the partial training dataset and not the removed dataset;  
updating the partial training dataset for every epoch by removing a number of the training samples from the partial training dataset and adding the removed training samples to the removed dataset, wherein for every epoch, a current partial training dataset is updated by removing an easiest percentage of the training sample from the partial training dataset and adding them to the removed dataset to reduce training floating point operations per second (FLOPs) to accelerate training of the electronic sparse network; and 
retrieving the same number of removed training samples from the removed dataset and adding the retrieved training samples back to the partial training dataset to keep the total number of training samples in the partial training dataset unchanged.
	However, Cheng teaches 
using a partial training dataset (Partial dataset is non discarded samples, Cheng, p. 2, Figure 1 description, “all the available training examples (orange dots) are adopted for training… as the training goes on, we gradually discard training examples”)
data sieving the training samples to update the partial training dataset(Cheng, p. 1, Abstract, “selectively drop easy samples and refresh them only periodically”)
randomly(Creating the datasets from removing samples is based on lowest losses which depends on random augmentations, Cheng, p.2, Figure 1 description, “Samples are removed by dropping examples with lowest losses and the discarded samples might be different in different training cycles”, p. 5, col. 2, paragraph 3, “We adopt standard data augmentations (random crop and random flip)”) selecting a percentage of total training samples of a training dataset to create the partial training dataset and a removed dataset (removed dataset is examples not kept, Cheng, p. 4, col. 2, paragraph 4, “a keep rate (p)… keep p% of the current examples”);
training the sparse network for every epoch using only the partial training dataset and not the removed dataset (Cheng, p. 2, Figure 1 description, “First, all the available training examples (orange dots) are adopted for training at the beginning of the training cycle j. Second, as the training goes on, we gradually discard training examples (gray) to save training computation”);  
updating the partial training dataset for every epoch by removing a number of the training samples from the partial training dataset and adding the removed training samples to the removed dataset (Cheng, p. 4, col. 2, paragraph 4, “subsample examples every I epochs”, Figure 1 shows data dropped per epoch, “as the training goes on, we gradually discard training examples (gray)”)), wherein for every epoch, a current partial training dataset is updated by removing an easiest percentage of the training sample from the partial training dataset and adding them to the removed dataset (easiest percentage of non kept examples is removed, Cheng, p. 4, col. 2, paragraph 4, “keep p% of the current examples”, Cheng, p. 7, col. 2, ¶2, “progressively drop easy examples”) to reduce training floating point operations per second (FLOPs) to accelerate training of the electronic sparse network (Cheng, p. 2, col. 1, paragraph 1, “our proposed method only requires 83%, 92%, and 86% training computation to achieve comparable or better results”); and 
retrieving the same number of removed training samples from the removed dataset and adding the retrieved training samples back to the partial training dataset to keep the total number of training samples in the partial training dataset unchanged (Every cycle removed training samples are added back, Cheng, p. 4, col. 2, paragraph 5, “we reuse all training examples (the “refresh” stage) followed by another “drop” stage”,  p. 2, Figure 1 description, “all the discarded samples are refreshed for training when a new cycle k is started”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Cheng data sieving Drop-and-Refresh method and the network training taught by Brock. The motivation would to do so would be to save training computation while retaining performance (Cheng, p. 2, col. 2, paragraph 2, “We present an efficient sampling method named Drop-and-Refresh (DaR) to sample a subset of non-i.i.d. training data that can save training computation while retaining the performance”).

Regarding Claim 19, the rejection of Claim 18 is incorporated and further, the claim is rejected for the same reasons as set forth in Claim 2.

Regarding Claim 20, the rejection of Claim 18 is incorporated and further, the claim is rejected for the same reasons as set forth in Claim 3.

Response to Arguments
35. U.S.C. 103
	Argument 1: Roberts fails to teach training the network with only the partial dataset and not the removed dataset for every epoch, because Roberts uses all the dataset including the removed dataset to train the model.
	Examiner Response: Cheng discloses this limitation. Claim 1 recites, “training the sparse network for every epoch using only the partial training dataset and not the removed data set”. Cheng teaches training every epoch using only the partial training dataset where the partial training dataset is defined as the dataset with kept training samples that are not discarded (Cheng, p. 2, Figure 1 description, “First, all the available training examples (orange dots) are adopted for training at the beginning of the training cycle j. Second, as the training goes on, we gradually discard training examples (gray) to save training computation” and Figure 1, see the training samples used for training(orange dots) for every epoch τj). As shown in Figure 1, Cheng does not use the removed dataset of discarded training examples for training in an epoch during any of the iterating epochs.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JESSE CHEN COULSON whose telephone number is (571)272-4716. The examiner can normally be reached Monday-Friday 8:30-5:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached at (571) 272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/JESSE C COULSON/
Examiner, Art Unit 2122

/KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122
Read full office action
Prosecution Timeline

Show 2 earlier events
Aug 12, 2025
Examiner Interview Summary
Aug 12, 2025
Applicant Interview (Telephonic)
Aug 15, 2025
Response Filed
Nov 03, 2025
Final Rejection mailed — §101, §103
Dec 24, 2025
Response after Non-Final Action
Feb 02, 2026
Request for Continued Examination
Feb 09, 2026
Response after Non-Final Action
Mar 03, 2026
Non-Final Rejection mailed — §101, §103 (current)
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
17%
Grant Probability
67%
With Interview (+50.0%)
3y 6m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 6 resolved cases by this examiner. Grant probability derived from career allowance rate.