Last updated: May 29, 2026
Application No. 18/157,476
Method and System for Improving Continual Learning Through Error Sensitivity Modulation

Non-Final OA §101§103
Filed
Jan 20, 2023
Priority
Sep 27, 2022 — NL 2033154 +1 more
Examiner
SPRAUL III, VINCENT ANTON
Art Unit
2129
Tech Center
2100 — Computer Architecture & Software
Assignee
Navinfo Europe B V
OA Round
1 (Non-Final)
This examiner grants 60% of cases after interview

— +26.7% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 37 resolved cases, 2023–2026
Examiner Intelligence

SPRAUL III, VINCENT ANTON View full profile →
Grants 60% of resolved cases
Career Allowance Rate
22 granted / 37 resolved
+4.5% vs TC avg
Strong +27% interview lift
Without
With
+26.7%
Interview Lift
resolved cases with interview
Typical timeline
4y 4m
Avg Prosecution
19 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
2.8%
-37.2% vs TC avg
§103
93.8%
+53.8% vs TC avg
§102
0.6%
-39.4% vs TC avg
§112
1.7%
-38.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 37 resolved cases
Office Action

§101 §103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Specification
The disclosure is objected to because of the following informalities. The specification provides mathematical formulas, both individually, as labelled in the specification with the marks (1) through (8), and also in the pseudo-code listing marked Algorithm 1. These formulas appear to be a grayscale reproduction of a color original, which has limited the legibility of the text, in particular, superscripted and subscripted character marks. Examiner respectfully suggests reproducing all formulas in black-and-white to improve their legibility.
Appropriate correction is required.

Drawings
The drawings are objected to because Fig. 1 appears to be a grayscale reproduction of a color original, which has limited the legibility of the figure, in particular the text. Examiner respectfully suggests rendering the figure in black-and-white. Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure number of an amended drawing should not be labeled as “amended.” If a drawing figure is to be canceled, the appropriate figure must be removed from the replacement sheet, and where necessary, the remaining figures must be renumbered and appropriate changes made to the brief description of the several views of the drawings for consistency. Additional replacement sheets may be necessary to show the renumbering of the remaining figures. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.

Claim Objections
Claim 11 objected to because of the following informality. Examiner respectfully suggests that the phrase “samples of a current task samples” should be written either as “samples of a current task ” or “.” Appropriate correction is required.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claim 15 rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter. Neither the claim nor the specification limit the “computer-readable medium” of the claim to non-transitory forms. The broadest reasonable interpretation of the phrase therefore includes signals per se, which do not fall within at least one of the four categories of patent eligible subject matter. The claim is therefore rejected under 35 U.S.C. 101.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-5, 8-9, 11, and 15-17 rejected under 35 U.S.C. 103 over Volpi et al., US Pre-Grant Publication No. 2023/0082941 (hereafter Volpi) in view of Tarvainen et al., “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” 2018, arXiv:1703.01780v6 (hereafter Tarvainen) and Kar et al., US Pre-Grant Publication No. 2020/0160178 (hereafter Kar). 

Regarding claim 1:
Volpi teaches:
“A computer-implemented method for continual learning in an artificial neural network comprising the steps of”: Volpi, paragraph 0001, “The present disclosure relates generally to machine learning, and more particularly to methods and systems for online continual learning methods for neural networks.”
“providing input samples in form of continuous data stream containing a sequence of tasks”: Volpi, paragraph 0004, “It is useful in many applications for a neural network-based model to be able to learn online after deployment using new data streams, such as those collected or generated from real-world data [input samples in form of continuous data stream]. This learning, referred to as online continual learning, is distinct from offline learning scenarios such as controlled, supervised learning scenarios that make deep learning effective”; Volpi, paragraph 0018, “In embodiments, the self-supervised learning objective is jointly optimized with a supervised objective in a multi-task setting [containing a sequence of tasks]; and the memory is shared by the self-supervised learning objective and the supervised objective.”
“training a working model of the network by using the input samples”: Volpi, paragraph 0005, “The method for samples received in the stream of samples for updating the machine learning model comprises: accessing from the memory a set of previous samples for training the machine learning model for performing the task; defining a set of combined samples that includes a sample received from the stream of samples and the set of previous samples accessed from the memory; training the machine learning model using the set of combined samples, the training the machine learning model defining an embedding space with the set of combined samples [training a working model of the network by using the input samples]; determining whether to store or not store the sample received from the stream of samples in the memory with the set of previous samples based on distances between samples in the set of combined samples in the  embedding space; and storing in the memory, with the set of previous samples, the sample received from the stream of samples when said determining determines to store the sample received from the stream of samples.”
“maintaining a fixed-size episodic memory by storing input samples from previous tasks in the episodic memory for consolidating knowledge through interleaved learning of samples from previous tasks”: Volpi, paragraph 0005, “The method for samples received in the stream of samples for updating the machine learning model comprises: accessing from the memory a set of previous samples for training the machine learning model for performing the task; defining a set of combined samples that includes a sample received from the stream of samples and the set of previous samples accessed from the memory; training the machine learning model using the set of combined samples, the training the machine learning model defining an embedding space with the set of combined samples; determining whether to store or not store the sample received from the stream of samples in the memory with the set of previous samples based on distances between samples in the set of combined samples in the  embedding space; and storing in the memory, with the set of previous samples, the sample received from the stream of samples when said determining determines to store the sample received from the stream of samples [episodic memory by storing input samples from previous tasks in the episodic memory for consolidating knowledge through interleaved learning of samples from previous tasks]”; Volpi, paragraph 0087, “It is further assumed that the model has access to a memory M of up to N samples. For illustrating an example embodiment, and without loss of generality, it can be further assumed that the memory bank (or memory buffer) is partitioned in K buckets of size N/K [fixed-size].”
(bold only) “and maintaining a memory of errors during training by calculating a supervised loss on the input samples and a current task and updating the memory of errors with an exponential moving average of the cross-entropy loss”: Volpi, paragraph 0086, “The task T can be considered as a classification task over K classes. An example model to be trained is a neural network trained in a supervised way using a cross-entropy loss via backpropagation [calculating a supervised loss on the input samples and a current task … with … cross-entropy loss] (as shown in FIG. 1 at 114).”
Volpi does not explicitly teach:
“maintaining a stable model comprising a long-term semantic memory for building consolidated structural representations by progressively aggregating synaptic weights of the working model during training”
“and maintaining a memory of errors during training by calculating a supervised loss on the input samples and a current task and updating the memory of errors with an exponential moving average of the cross-entropy loss”
Tarvainen teaches “maintaining a stable model comprising a long-term semantic memory for building consolidated structural representations by progressively aggregating synaptic weights of the working model during training”: Tarvainen, section 2, paragraph 1, “To overcome the limitations of Temporal Ensembling, we propose averaging model weights instead of predictions. Since the teacher model is an average of consecutive student models, we call this the Mean Teacher method (Figure 2). Averaging model weights over training steps tends to produce a more accurate model than using the final weights directly [19] [maintaining a stable model comprising a long-term semantic memory for building consolidated structural representations by progressively aggregating synaptic weights of the working model during training]. We can take advantage of this during training to construct better targets. Instead of sharing the weights with the student model, the teacher model uses the EMA weights of the student model. Now it can aggregate information after every step instead of every epoch.”
Tarvainen and Volpi are analogous arts as they are both related to machine learning model training methods. It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the claimed invention to have combined the synaptic weight aggregation of Tarvainen with the teachings of Volpi to arrive at the present invention, in order to improve model training, as stated in Tarvainen, section 2, paragraph 1, “We can take advantage of this during training to construct better targets. Instead of sharing the weights with the student model, the teacher model uses the EMA weights of the student model. Now it can aggregate information after every step instead of every epoch.”
	Kar teaches (bold only) “and maintaining a memory of errors during training by calculating a supervised loss on the input samples and a current task and updating the memory of errors with an exponential moving average of the cross-entropy loss”: Kar, paragraphs 0043-0044, “Now referring to FIG. 2C, FIG. 2C is an example illustration of a process 200C for training a distribution transformer model for computing transformed scene graphs fine-tuned for a task of a downstream task network, in accordance with some embodiments of the present disclosure. For example, a second objective of the distribution transformer 108 may be to generate data from the probabilistic grammar 102, P, such that a model trained on this data achieves best performance when tested on a target validation set of real-world data, V. This may be referred to as a meta-objective, where the input data may be optimized to improve accuracy on the validation set, V. This meta-objective may be tuned to a task network that is trained for a particular downstream task […] The task loss in equation (3) may not be differentiable with respect to the parameters, Θ, since the score is measured using validation data and not S’. A reinforce score function estimator, or another unbiased estimator of the gradient, may be used to compute the gradients of equation (3). Reformulating the objective as a loss [calculating a supervised loss on the input samples] and writing the gradient yields equation ( 4), below:

    PNG
    media_image1.png
    75
    433
    media_image1.png
    Greyscale

To reduce the variance of the gradient from the estimator, an exponential moving average of previous scores may be tracked and subtracted from a current score [maintaining a memory of errors during training and updating the memory of errors with an exponential moving average]. This expectation may be approximated using one sample from GΘ(S).”
Kar and Volpi are analogous arts as they are both related to machine learning model training. It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the claimed invention to have combined the moving average of loss scores of Kar with the teachings of Volpi to arrive at the present invention, in order to reduce variance, as stated in Kar, paragraph 0044, “To reduce the variance of the gradient from the estimator, an exponential moving average of previous scores may be tracked and subtracted from a current score.”

Regarding claim 2:
Volpi as modified by Tarvainen and Kar teaches “The computer-implemented method according to claim 1.”
Volpi further teaches “wherein the step of maintaining a fixed size episodic memory comprises the step of employing reservoir sampling and a joint distribution of data stream is approximated by assigning to each sample in the data stream an equal probability of being represented in the episodic memory”: Volpi, paragraph 0093, “For instance, in the reservoir sampling method [employing reservoir sampling] shown in FIG. 2B, sample k can be a random integer such that k ∈ T. If k <= N the kth sample is swapped with the new sample (x,, y,), but if k>N, it is not. With this method, the probability of replacing a sample decays linearly in accordance with the time step t<T. This sampling technique ensures that each sample has an equal probability of having been present in the reservoir (N-dimensional memory 204) at any time t [a joint distribution of data stream is approximated by assigning to each sample in the data stream an equal probability of being represented in the episodic memory].”

Regarding claim 3:
Volpi as modified by Tarvainen and Kar teaches “The computer-implemented method according to claim 1.”
Tarvainen further teaches “wherein the step of maintaining a stable model comprising a long-term semantic memory comprises the step of initializing the semantic memory with weights of the working memory”: , Fig. 2 caption, “Both the student and the teacher model evaluate the input applying noise (η, η’) within their computation. The softmax output of the student model is compared with the one-hot label using classification cost and with the teacher output using consistency cost. After the weights of the student model have been updated with gradient descent, the teacher model weights are updated as an exponential moving average of the student weights [initializing the semantic memory with weights of the working memory]. Both model outputs can be used for prediction, but at the end of the training the teacher prediction is more likely to be correct. A training step with an unlabeled example would be similar, except no classification cost would be applied.”
Tarvainen and Volpi are combinable for the rationale given under claim 1.

Regarding claim 4:
Volpi as modified by Tarvainen and Kar teaches “The computer-implemented method according to claim 1.”
Tarvainen further teaches “wherein the step of maintaining a stable model comprising a long-term semantic memory comprises the step of stochastically updating the semantic memory using an exponentially moving average of weights of the working memory”: Tarvainen, section B.2.2., paragraph 3, “We trained the network using stochastic gradient descent [stochastically updating the semantic memory] with maximum learning rate 0.25 and Nesterov momentum 0.9”; Tarvainen, section 2, paragraph 1, “To overcome the limitations of Temporal Ensembling, we propose averaging model weights instead of predictions. Since the teacher model is an average of consecutive student models, we call this the Mean Teacher method (Figure 2). Averaging model weights over training steps tends to produce a more accurate model than using the final weights directly [19]. We can take advantage of this during training to construct better targets. Instead of sharing the weights with the student model, the teacher model uses the EMA weights of the student model [using an exponentially moving average of weights of the working memory]. Now it can aggregate information after every step instead of every epoch.”
Tarvainen and Volpi are combinable for the rationale given under claim 1.

Regarding claim 5:
Volpi as modified by Tarvainen and Kar teaches “The computer-implemented method of claim 1.”
Kar further teaches “further comprising the step of determining a degree of contribution of each input sample towards learning by calculating a cross-entropy loss for each input sample on the stable model for evaluating a weight given to each sample by calculating a distance between the cross-entropy loss of the input samples and mean statistics of the memory of errors, such as the exponentially moving average of the cross-entropy loss in the error memory”: Kar, paragraph 0035, “The input feature of each node may be its attribute set, SA, which may be defined consistently across all nodes. Since SA may be composed of different categorical and continuous components, appropriate losses may be used per feature component when training using per node reconstruction loss 206. For example, and without limitation, cross-entropy loss [cross-entropy loss] may be used for categorical attributes while L1 loss may be used for continuous attributes”; Kar, paragraphs 0043-0044, “Now referring to FIG. 2C, FIG. 2C is an example illustration of a process 200C for training a distribution transformer model for computing transformed scene graphs fine-tuned for a task of a downstream task network, in accordance with some embodiments of the present disclosure. For example, a second objective of the distribution transformer 108 may be to generate data from the probabilistic grammar 102, P, such that a model trained on this data achieves best performance when tested on a target validation set of real-world data, V. This may be referred to as a meta-objective, where the input data may be optimized to improve accuracy on the validation set, V. This meta-objective may be tuned to a task network that is trained for a particular downstream task […] The task loss in equation (3) may not be differentiable with respect to the parameters, Θ, since the score is measured using validation data and not S’. A reinforce score function estimator, or another unbiased estimator of the gradient, may be used to compute the gradients of equation (3). Reformulating the objective as a loss and writing the gradient yields equation ( 4), below:

    PNG
    media_image1.png
    75
    433
    media_image1.png
    Greyscale

To reduce the variance of the gradient from the estimator, an exponential moving average of previous scores may be tracked and subtracted from a current score [calculating a distance between the cross-entropy loss of the input samples and mean statistics of the memory of errors, such as the exponentially moving average of the cross-entropy loss in the error memory]. This expectation may be approximated using one sample from GΘ(S).”
	Kar and Volpi are combinable for the rationale given under claim 1.

Regarding claim 8:
Volpi as modified by Tarvainen and Kar teaches “The computer-implemented method of claim 1.”
Volpi further teaches (bold only) “implementing a dual memory replay mechanism wherein the stable model is configured for extracting semantic information from samples of the episodic memory”: Volpi, paragraph 0070, “The incoming stream of samples can also be stacked at stack 106 with a set of previous samples accessed from the current contents of the memory 104 at each time step t to form a set of combined samples, e.g., a training mini-batch, for a neural network model 108 implemented by the processor, which model 108 can include feature extraction 110 and logit layer blocks. The space where vectors from the feature extraction 110 lie is referred to as an embedding space. The embedding space represents high dimensional data (e.g., text, images, items) in a low-dimensional representation ( e.g., using real vectors) [wherein the … model is configured for extracting semantic information from samples of the episodic memory]. That is, the embedding space is a multi-dimensional space where the vectors produced by the feature extractor exist.”
Tarvainen further teaches :
“implementing a dual memory replay mechanism wherein the stable model is configured for extracting semantic information from samples of the episodic memory”: Tarvainen, section 2, paragraph 1, “To overcome the limitations of Temporal Ensembling, we propose averaging model weights instead of predictions. Since the teacher model is an average of consecutive student models, we call this the Mean Teacher method (Figure 2). Averaging model weights over training steps tends to produce a more accurate model than using the final weights directly [19] [the stable model]. We can take advantage of this during training to construct better targets. Instead of sharing the weights with the student model, the teacher model uses the EMA weights of the student model. Now it can aggregate information after every step instead of every epoch.”
“and enforcing consistency in a functional space by using relational knowledge encoded in output logits”: Tarvainen, section 2, paragraph 2, “More formally, we define the consistency cost J as the expected distance between the prediction of the student model (with weights Θ and noise η) and the prediction of the teacher model (with weights Θ’ and noise η’) [enforcing consistency in a functional space by using relational knowledge encoded in output logits].”
Tarvainen and Volpi are combinable for the rationale given under claim 1.

Regarding claim 9:
Volpi as modified by Tarvainen and Kar teaches “The computer-implemented method of claim 1.”
Tarvainen further teaches “the step of calculating a loss on the samples from the episodic memory by calculating a combination of cross-entropy loss and a semantic consistency loss”: Tarvainen, section B.1, paragraph 3, “We used cross-entropy between the student softmax output and the one-hot label as the classification cost, and the mean square error between the student and teacher softmax outputs as the consistency cost. The total cost was the weighted sum of these costs [calculating a loss on the samples from the episodic memory by calculating a combination of cross-entropy loss and a semantic consistency loss], where the weight of classification cost was the expected number of labeled examples per minibatch, subject to the ramp-ups described below.”
Tarvainen and Volpi are combinable for the rationale given under claim 1.

Regarding claim 11:
Volpi as modified by Tarvainen and Kar teaches “The computer-implemented method of claim 1.”
Volpi further teaches “the step of calculating an overall loss for the working model by calculating a sum of losses on samples of a current task samples and samples of the episodic memory”: Volpi, paragraph 0086, “The task T can be considered as a classification task over K classes. An example model to be trained is a neural network trained in a supervised way using a cross-entropy loss via backpropagation (as shown in FIG. 1 at 114)”; Volpi, paragraph 0005, “The method for samples received in the stream of samples for updating the machine learning model comprises: accessing from the memory a set of previous samples for training the machine learning model for performing the task; defining a set of combined samples that includes a sample received from the stream of samples and the set of previous samples accessed from the memory; training the machine learning model using the set of combined samples [calculating a sum of losses on samples of a current task samples and samples of the episodic memory], the training the machine learning model defining an embedding space with the set of combined; determining whether to store or not store the sample received from the stream of samples in the memory with the set of previous samples based on distances between samples in the set of combined samples in the  embedding space; and storing in the memory, with the set of previous samples, the sample received from the stream of samples when said determining determines to store the sample received from the stream of samples.”

Regarding claim 15:
Volpi as modified by Tarvainen and Kar teaches “The computer-implemented method of claim 1.”
Kar further teaches “A computer-readable medium provided with a computer program that when the computer program is loaded and executed by a computer, the computer program causes the computer to carry out the steps”: Kar, paragraphs 0073-0074, “The memory 804 may include any of a variety of computer-readable media. The computer-readable media may be any available media that may be accessed by the computing device 800. The computer-readable media may include both volatile and nonvolatile media, and removable and non-removable media. By way of example, and not limitation, the computer-readable media may comprise computer-storage media and communication media. The computer-storage media may include both volatile and nonvolatile media and/or removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules, and/or other data types. For example, the memory 804 may store computer-readable instructions (e.g., that represent a program(s) and/or a program element(s), such as an operating system [A computer-readable medium provided with a computer program that when the computer program is loaded and executed by a computer, the computer program causes the computer to carry out the steps].”
Kar and Volpi are analogous arts as they are both related to machine learning model training. It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the claimed invention to have combined the computer-readable medium of Kar with the teachings of Volpi to arrive at the present invention, in order to provide the method in a form suitable for computer execution, as stated in Kar, paragraph 0074, “By way of example, and not limitation, the computer-readable media may comprise computer-storage media and communication media. The computer-storage media may include both volatile and nonvolatile media and/or removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules, and/or other data types. For example, the memory 804 may store computer-readable instructions (e.g., that represent a program(s) and/or a program element(s), such as an operating system.”

Regarding claim 16:
Volpi as modified by Tarvainen and Kar teaches “the computer-implemented method according to claim 1.”
Volpi further teaches (bold only) “An autonomous vehicle comprising a data processing system loaded with a computer program arranged for causing the data processing system to carry out the steps of the computer-implemented method according to claim 1 for enabling the autonomous vehicle to continually adapt and acquire knowledge from an environment surrounding the autonomous vehicle”: Volpi, paragraphs 0061-0062, “Example online continual learning methods [continually adapt and acquire knowledge] herein are generally applicable to tasks performed by neural network models including but not limited to supervised or self-supervised tasks, or a combination of supervised and self-supervised tasks ( e.g., jointly optimized based on supervised and self-supervised objectives). Joint optimization may share the memory and encode features in the memory based on the supervised/self-supervised objectives. Example tasks include, but are not limited to, classification-based tasks. Non-limiting example applications that can benefit from example online continual learning methods include: house or other personal robots, whose underlying knowledge needs to be adapted to endlessly varying house environments while not forgetting important previous training; self-driving cars or other vehicles, whose processor-based vision modules may need specific adjustments according to the specific environments ( e.g. urban/rural, private/public, etc.) they need to engage with [enabling the autonomous vehicle to continually adapt and acquire knowledge from an environment surrounding the autonomous vehicle]; processor-based language models such as but not limited to natural language processing (NLP) models, which should account for new information that is continually generated; processor implemented search engines that use online learning methods; processor-implemented algorithms, e.g., by a server or client device, that process social network data flows, which are extremely heterogeneous; and many others.”

Regarding claim 17:
Volpi as modified by Tarvainen and Kar teaches “The computer-implemented method according to claim 1.”
Kar further teaches “wherein the supervised loss is a mean cross entropy loss”: Kar, paragraph 0035, “The input feature of each node may be its attribute set, SA, which may be defined consistently across all nodes. Since SA may be composed of different categorical and continuous components, appropriate losses may be used per feature component when training using per node reconstruction loss 206. For example, and without limitation, cross-entropy loss [the supervised loss is a … cross entropy loss] may be used for categorical attributes while L1 loss may be used for continuous attributes”; Kar, paragraph 0043, “The task loss in equation (3) may not be differentiable with respect to the parameters, Θ, since the score is measured using validation data and not S’. A reinforce score function estimator, or another unbiased estimator of the gradient, may be used to compute the gradients of equation (3). Reformulating the objective as a loss and writing the gradient yields equation ( 4), below:

    PNG
    media_image1.png
    75
    433
    media_image1.png
    Greyscale

To reduce the variance of the gradient from the estimator, an exponential moving average of previous scores may be tracked and subtracted from a current score [the supervised loss is a mean]. This expectation may be approximated using one sample from GΘ(S).”
	Kar and Volpi are combinable for the rationale given under claim 1.

Claims 6, 10, and 14 rejected under 35 U.S.C. 103 over Volpi as modified by Tarvainen and Kar in view of Huang et al.,  “Self-Adaptive Training: Bridging Supervised and Self-Supervised Learning,” 2021, arXiv:2101.08732v2 (hereafter Huang).

Regarding claim 6:
Volpi as modified by Tarvainen and Kar teaches “The computer-implemented method of claim 1.”
Kar further teaches (bold only) “the step of adjusting a level of contribution of an input sample to the training of the working model by assigning a weight to the input sample, wherein the weight is configured to be inversely proportional to the distance between the cross-entropy loss of the input sample and mean statistics of the error memory, such as the exponentially moving average of the cross-entropy loss in the memory of errors”: Kar, paragraph 0035, “The input feature of each node may be its attribute set, SA, which may be defined consistently across all nodes. Since SA may be composed of different categorical and continuous components, appropriate losses may be used per feature component when training using per node reconstruction loss 206. For example, and without limitation, cross-entropy loss [cross-entropy loss] may be used for categorical attributes while L1 loss may be used for continuous attributes”; Kar, paragraphs 0043-0044, “Now referring to FIG. 2C, FIG. 2C is an example illustration of a process 200C for training a distribution transformer model for computing transformed scene graphs fine-tuned for a task of a downstream task network, in accordance with some embodiments of the present disclosure. For example, a second objective of the distribution transformer 108 may be to generate data from the probabilistic grammar 102, P, such that a model trained on this data achieves best performance when tested on a target validation set of real-world data, V. This may be referred to as a meta-objective, where the input data may be optimized to improve accuracy on the validation set, V. This meta-objective may be tuned to a task network that is trained for a particular downstream task […] The task loss in equation (3) may not be differentiable with respect to the parameters, Θ, since the score is measured using validation data and not S’. A reinforce score function estimator, or another unbiased estimator of the gradient, may be used to compute the gradients of equation (3). Reformulating the objective as a loss and writing the gradient yields equation ( 4), below:

    PNG
    media_image1.png
    75
    433
    media_image1.png
    Greyscale

To reduce the variance of the gradient from the estimator, an exponential moving average of previous scores may be tracked [the exponentially moving average of the cross-entropy loss in the memory of errors] and subtracted from a current score. This expectation may be approximated using one sample from GΘ(S).”
	Kar and Volpi are combinable for the rationale given under claim 1.
Volpi as modified by Tarvainen and Kar does not explicitly teach (bold only) “the step of adjusting a level of contribution of an input sample to the training of the working model by assigning a weight to the input sample, wherein the weight is configured to be inversely proportional to the distance between the cross-entropy loss of the input sample and mean statistics of the error memory, such as the exponentially moving average of the cross-entropy loss in the memory of errors.”
Huang teaches (bold only) “the step of adjusting a level of contribution of an input sample to the training of the working model by assigning a weight to the input sample, wherein the weight is configured to be inversely proportional to the distance between the cross-entropy loss of the input sample and mean statistics of the error memory, such as the exponentially moving average of the cross-entropy loss in the memory of errors”: Huang, section 2.2, paragraph 1, “Then, the training targets track all historical model predictions during training and are updated by Exponential-Moving-Average (EMA) scheme as

    PNG
    media_image2.png
    44
    385
    media_image2.png
    Greyscale
”; Huang, section 3.1, paragraphs 1-3, Following the common practice in supervised learning, the loss function is implemented as the cross entropy loss between model predictions pi and training targets ti [the cross-entropy loss of the input sample] […] Based on the scheme presented above, we introduce a simple yet effective sample reweighting scheme on each sample [by assigning a weight to the input sample]. Concretely, given training target ti, we set

    PNG
    media_image3.png
    46
    315
    media_image3.png
    Greyscale

[wherein the weight is configured to be inversely proportional to the distance between the cross-entropy loss of the input sample and mean statistics of the error memory] The sample weight wi ∈ [ 1/c, 1] reveals the labeling confidence of this sample. Intuitively, all samples are treated equally in the first Es epochs. As the target ti being updated, our algorithm pays less attention to potentially erroneous data and learns more from potentially clean data. This scheme also allows the corrupted samples to re-attain attention if they are confidently corrected.”
Huang and Volpi are analogous arts as they are both related to machine learning model training. It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the claimed invention to have combined the sample weighting of Huang with the teachings of Volpi to arrive at the present invention, in order to focus model training on less noisy data, as stated in Huang, section 3.1, paragraph 3, “As the target ti being updated, our algorithm pays less attention to potentially erroneous data and learns more from potentially clean data.”

Regarding claim 10:
Volpi as modified by Tarvainen and Kar teaches “The computer-implemented method of claim 1.”
Volpi as modified by Tarvainen and Kar does not explicitly teach “the step of calculating an error sensitivity modulated task loss for the working model by calculating a weighted sum of all samples in a current task.”
Huang teaches “the step of calculating an error sensitivity modulated task loss for the working model by calculating a weighted sum of all samples in a current task”: Huang, section 3.1, paragraphs 3-4: “Based on the scheme presented above, we introduce a simple yet effective sample reweighting scheme on each sample. Concretely, given training target ti, we set

    PNG
    media_image3.png
    46
    315
    media_image3.png
    Greyscale

The sample weight wi ∈ [ 1/c, 1] reveals the labeling confidence of this sample. Intuitively, all samples are treated equally in the first Es epochs. As the target ti being updated, our algorithm pays less attention to potentially erroneous data and learns more from potentially clean data. This scheme also allows the corrupted samples to re-attain attention if they are confidently corrected. Putting everything together; We use stochastic gradient descent to minimize:

    PNG
    media_image4.png
    61
    411
    media_image4.png
    Greyscale

[calculating an error sensitivity modulated task loss for the working model by calculating a weighted sum of all samples in a current task] during the training process. Here, the denominator normalizes per sample weights and stabilizes the loss scale.”
Huang and Volpi are analogous arts as they are both related to machine learning model training. It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the claimed invention to have combined the sample weighting of Huang with the teachings of Volpi to arrive at the present invention, in order to focus model training on less noisy data, as stated in Huang, section 3.1, paragraph 3, “As the target ti being updated, our algorithm pays less attention to potentially erroneous data and learns more from potentially clean data.”

Regarding claim 14:
Volpi as modified by Tarvainen and Kar teaches “The computer-implemented method of claim 1.”
Volpi as modified by Tarvainen and Kar does not explicitly teach “the step of preventing abrupt changes in estimations at the task boundary by employing a task warm-up phase wherein the exponentially moving average is not updated during the warm-up phase.”
Huang teaches “the step of preventing abrupt changes in estimations at the task boundary by employing a task warm-up phase wherein the exponentially moving average is not updated during the warm-up phase”: Huang, section 2.2, paragraph 1, “Then, the training targets track all historical model predictions during training and are updated by Exponential-Moving-Average (EMA) scheme as

    PNG
    media_image2.png
    44
    385
    media_image2.png
    Greyscale
”; Huang, section 3.1, paragraph 2, “During the training process, we fix ti in the first Es training epochs and update the training targets ti according to Equation (2) in each following training epoch [employing a task warm-up phase wherein the exponentially moving average is not updated during the warm-up phase]. The number of initial epochs Es allows the model to capture informative signals in the data set and excludes ambiguous information that is provided by model predictions in the early stage of training.”
Huang and Volpi are analogous arts as they are both related to machine learning model training. It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the claimed invention to have combined the warm-up period of Huang with the teachings of Volpi to arrive at the present invention, in order to improve model predictions by limiting ambiguous data, as stated in Huang, section 3.1, paragraph 2, “The number of initial epochs Es allows the model to capture informative signals in the data set and excludes ambiguous information that is provided by model predictions in the early stage of training.”

Claim 7 rejected under 35 U.S.C. 103 over Volpi as modified by Tarvainen and Kar in view of Soviany et al., “Curriculum Learning: A Survey,” April 2022, arXiv:2101.10382v3 (hereafter Soviany).
Volpi as modified by Tarvainen and Kar teaches “The computer-implemented method of claim 1.”
	Volpi as modified by Tarvainen and Kar does not explicitly teach “the step of pre-selecting candidates for the episodic memory wherein only task samples with a loss lower than a user-defined threshold are passed to the episodic memory for selection.”
Soviany teaches “the step of pre-selecting candidates for the episodic memory wherein only task samples with a loss lower than a user-defined threshold are passed to the episodic memory for selection”: Soviany, section 4.1, paragraph 10, “Ma et al. (2017) borrow the instructor-student collaborative intuition from SPCL and introduce a self-paced co-training strategy. They extend the traditional SPL approach to the two-view scenario, by adding importance weights for the views on top of the corresponding regularizer. The algorithm uses a ‘draw with replacement’ methodology, i.e., previously selected examples from the pool are kept only if the value of the loss is lower than a fixed threshold [pre-selecting candidates for the episodic memory wherein only task samples with a loss lower than a user-defined threshold are passed to the episodic memory for selection, wherein user-defined threshold interpreted as including fixed values selected by designer of system]. To test their approach, the authors conduct extensive text classification and person re-identification experiments.”
Soviany and Volpi are analogous arts as they are both related to machine learning model training. It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the claimed invention to have combined the sample selection threshold of Soviany with the teachings of Volpi to arrive at the present invention, in order to improve model results, as stated in Soviany, section 4.1, paragraph 2, “The idea of presenting the examples in a meaningful order, starting from the easiest samples, then gradually introducing more complex ones, was inspired by the way humans learn. To show that automatic models benefit from such a training strategy, achieving faster convergence, while finding a better local minimum, the authors conduct multiple experiments.”

Claims 12-13 rejected under 35 U.S.C. 103 over Volpi as modified by Tarvainen and Kar in view of Alharbi et al., “Error-Based Noise Filtering During Neural Network Training,” 2020, IEEE Access, vol. 8, pp. 156996-157004 (hereafter Alharbi).

Regarding claim 12:
Volpi as modified by Tarvainen and Kar teaches “The computer-implemented method of claim 1.”
Volpi further teaches (bold only) “the step of filtering sample losses by configuring the distance between the cross-entropy loss of input sample and mean statistics of the error memory, such as the exponentially moving average of the cross-entropy loss in the memory of errors to be equal or less than a predefined standard deviation”: Volpi, paragraph 0086, “The task T can be considered as a classification task over K classes. An example model to be trained is a neural network trained in a supervised way using a cross-entropy loss [the cross-entropy loss of input sample] via backpropagation (as shown in FIG. 1 at 114).”
Kar further teaches (bold only) “the step of filtering sample losses by configuring the distance between the cross-entropy loss of input sample and mean statistics of the error memory, such as the exponentially moving average of the cross-entropy loss in the memory of errors to be equal or less than a predefined standard deviation”: Kar, paragraphs 0043-0044, “Now referring to FIG. 2C, FIG. 2C is an example illustration of a process 200C for training a distribution transformer model for computing transformed scene graphs fine-tuned for a task of a downstream task network, in accordance with some embodiments of the present disclosure. For example, a second objective of the distribution transformer 108 may be to generate data from the probabilistic grammar 102, P, such that a model trained on this data achieves best performance when tested on a target validation set of real-world data, V. This may be referred to as a meta-objective, where the input data may be optimized to improve accuracy on the validation set, V. This meta-objective may be tuned to a task network that is trained for a particular downstream task […] The task loss in equation (3) may not be differentiable with respect to the parameters, Θ, since the score is measured using validation data and not S’. A reinforce score function estimator, or another unbiased estimator of the gradient, may be used to compute the gradients of equation (3). Reformulating the objective as a loss and writing the gradient yields equation ( 4), below:

    PNG
    media_image1.png
    75
    433
    media_image1.png
    Greyscale

To reduce the variance of the gradient from the estimator, an exponential moving average of previous scores may be tracked and subtracted from a current score [configuring the distance between the … loss of input sample and mean statistics of the error memory, such as the exponentially moving average of the … loss in the memory of errors]. This expectation may be approximated using one sample from GΘ(S).”
Kai and Volpi are combinable for the rationale given under claim 1.
Volpi as modified by Tarvainen and Kar does not explicitly teach (bold only) “the step of filtering sample losses by configuring the distance between the cross-entropy loss of input sample and mean statistics of the error memory, such as the exponentially moving average of the cross-entropy loss in the memory of errors to be equal or less than a predefined standard deviation.”
Alharbi teaches (bold only) “the step of filtering sample losses by configuring the distance between the cross-entropy loss of input sample and mean statistics of the error memory, such as the exponentially moving average of the cross-entropy loss in the memory of errors to be equal or less than a predefined standard deviation”: Alharbi, section III, paragraph 8, “Recall that the d instances are determined based on their EMA values. To determine the instances that will be removed as outliers, we sort the elements that are greater than μ + σ (k instances) [equal or less than a predefined standard deviation] in ascending order and remove the d instances with greater EMA values [filtering sample losses].”
Alharbi and Volpi are analogous arts as they are both related to machine learning model training. It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the claimed invention to have combined the filtering of training data of Alharbi with the teachings of Volpi to arrive at the present invention, in order to improve model accuracy, as stated in Alharbi, Abstract, “Our evaluation of the efficacy of our method on three well-known benchmark datasets demonstrates an improvement on classification accuracy in the presence of noise.”

Regarding claim 13:
Volpi as modified by Tarvainen and Kar teaches “The computer-implemented method of claim 1.”
	Kar further teaches (bold only) “the step of updating the memory of errors with the exponential moving average of the means of filtered sample losses”:  Kar, paragraphs 0043-0044, “Now referring to FIG. 2C, FIG. 2C is an example illustration of a process 200C for training a distribution transformer model for computing transformed scene graphs fine-tuned for a task of a downstream task network, in accordance with some embodiments of the present disclosure. For example, a second objective of the distribution transformer 108 may be to generate data from the probabilistic grammar 102, P, such that a model trained on this data achieves best performance when tested on a target validation set of real-world data, V. This may be referred to as a meta-objective, where the input data may be optimized to improve accuracy on the validation set, V. This meta-objective may be tuned to a task network that is trained for a particular downstream task […] The task loss in equation (3) may not be differentiable with respect to the parameters, Θ, since the score is measured using validation data and not S’. A reinforce score function estimator, or another unbiased estimator of the gradient, may be used to compute the gradients of equation (3). Reformulating the objective as a loss and writing the gradient yields equation ( 4), below:

    PNG
    media_image1.png
    75
    433
    media_image1.png
    Greyscale

To reduce the variance of the gradient from the estimator, an exponential moving average of previous scores may be tracked and subtracted from a current score [updating the memory of errors with the exponential moving average of the means of … sample losses]. This expectation may be approximated using one sample from GΘ(S).”
	Kar and Volpi are combinable for the rationale given under claim 1.
	Volpi as modified by Tarvainen and Kar does not explicitly teach (bold only) “the step of updating the memory of errors with the exponential moving average of the means of filtered sample losses.”
Alharbi teaches (bold only) “the step of updating the memory of errors with the exponential moving average of the means of filtered sample losses”: Alharbi, section III, paragraph 8, “Recall that the d instances are determined based on their EMA values. To determine the instances that will be removed as outliers, we sort the elements that are greater than μ + σ (k instances) in ascending order and remove the d instances with greater EMA values [the exponential moving average of the means of filtered sample losses].”
Alharbi and Volpi are analogous arts as they are both related to machine learning model training. It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the claimed invention to have combined the filtering of training data of Alharbi with the teachings of Volpi to arrive at the present invention, in order to improve model accuracy, as stated in Alharbi, Abstract, “Our evaluation of the efficacy of our method on three well-known benchmark datasets demonstrates an improvement on classification accuracy in the presence of noise.”

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Chaudhry et al., “Continual Learning with Tiny Episodic Memories,” 2019, arXiv:1902.10486v1, discloses a method for leveraging previous training data in ongoing training to mitigate catastrophic forgetting.
Goyal et al., US Pre-Grant Publication No. 2021/0374566, discloses a method of using the distribution of weights for classifier models trained for previous tasks in assigning weights to a classifier model for a current task.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to VINCENT SPRAUL whose telephone number is (703) 756-1511. The examiner can normally be reached M-F 9:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, MICHAEL HUNTLEY can be reached at (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/VAS/Examiner, Art Unit 2129                                                                                                                                                                                                        
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129
Read full office action
Prosecution Timeline

Jan 20, 2023
Application Filed
Nov 06, 2025
Non-Final Rejection mailed — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/163,396
Patent 12619905
FLEXIBLE EMBEDDING SYSTEMS AND METHODS FOR REAL-TIME COMPARISONS
5y 3m to grant Granted May 05, 2026
17/557,599
Patent 12608446
DETERMINING PERFORMANCE CHANGE WITHIN A DATASET WITH AN APPLIED CONDITION USING MACHINE LEARNING MODELS
4y 4m to grant Granted Apr 21, 2026
17/163,383
Patent 12591634
COMPOSITE EMBEDDING SYSTEMS AND METHODS FOR MULTI-LEVEL GRANULARITY SIMILARITY RELEVANCE SCORING
5y 2m to grant Granted Mar 31, 2026
17/249,028
Patent 12591796
INTELLIGENT DISTANCE PROMPTING
5y 1m to grant Granted Mar 31, 2026
17/353,931
Patent 12572620
RELIABLE INFERENCE OF A MACHINE LEARNING MODEL
4y 8m to grant Granted Mar 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
60%
Grant Probability
86%
With Interview (+26.7%)
4y 4m (~11m remaining)
Median Time to Grant
Low
PTA Risk
Based on 37 resolved cases by this examiner. Grant probability derived from career allowance rate.