Office Action Analysis: 18356360 — DEVICE AND METHOD FOR CONTINUAL LEARNING BASED ON SPECULATIVE BACKPROPAGATION AND ACTIVATION HISTORY

Office Action

§101 §103 §112
DETAILED ACTION This communication is in response to the Application No. 18/199,819 filed on July 21 , 2023 in which Claims 1 - 11 are presented for examination. Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA. Claim Rejections - 35 USC § 112 The following is a quotation of 35 U.S.C. 112(b): (b ) CONCLUSION.— The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention. The following is a quotation of 35 U.S.C. 112 (pre-AIA), second paragraph: The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the appl icant regards as his invention. Claim 1-6 and 11 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention. Claim FILLIN "Enter claim identification information" \* MERGEFORMAT 1 recites the limitation "the update operation" in line 10. There is insufficient antecedent basis for this limitation in the claim. The phrase "the update operation" was not previously introduced. Since independent claim 1 is rejected under 35 U.S.C. 112(b), claims 2 - 6 are also rejected under 35 U.S.C. 112(b) because they depend on claim 1 and inherit the same indefiniteness problems. Claim 11 recites the limitation "the update operation" in line 2. There is insufficient antecedent basis for this limitation in the claim. The phrase "the update operation" was not previously introduced . Claim Rejections - 35 USC § 101 35 U.S.C. 101 reads as follows: Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title. Claim 1-11 are rejected under 35 U.S.C. 101 because these claimed inventions are directed to an abstract idea without significantly more. Regarding Claim 1: Step 1: Claim 1 is a method type claim. Therefore, Claims 1-6 fall within one of the four statutory categories (i.e., process, machine, manufacture, or composition of matter). 2A Prong 1: If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation by mathematical calculation but for the recitation of generic computer components, then it falls within the “Mathematical Concepts” grouping of abstract ideas. a forward propagation operation of performing a forward propagation (mathematical concept - mathematical calculations used to compute outputs of a neural network from input data – as disclosed in the specification (Paragraph [0022], “ In the forward propagation, data is propagated from an input layer to an output layer. Each neuron computes a weighted sum of inputs from connected neurons in its prior layer and then adds a value calculated as shown in Equation 1 with a bias ” ) ) a backward propagation operation of performing a backward propagation (mathematical concept - backward propagation in a neural network involves mathematical calculations such as computing error derivatives, gradients, and propagating those values through layers to adjust model parameters – as disclosed in the specification (Paragraph [0024], “ Backward propagation (backpropagation) is used to adjust weights ( wij j ) by calculating derivates. The backpropagation starts from the output layer based on, for example, Softmax . The derivative in the output layer is expressed in Equation 4. ” ) ) and a weight update operation of performing a weight update, wherein the forward propagation operation, the backward propagation operation, and the weight update operation are repeatedly performed, and the update operation is performed based on an activation tendency […] (mathematical concept - recites mathematical calculations used to iteratively adjust neural network parameters based on activation history values and gradient based optimization – as disclosed in the specification (Paragraph [0025], “In DNN training, weights are adjusted based on errors computed in the backpropagation. Initially, as in Equation 7, Δwij j is calculated by multiplying backpropagation outcome δi l by forward propagation outcome yj j-1 Then, weights ( wij j ) are updated according to Equation 8. Here, a learning rate η determines a degree of learning. This process is repeated for all the weights.”) ) Step 2A Prong 2 : This judicial exception is not integrated into a practical application. […] of each of neurons included in the deep learning model, in a process in which training for the first task proceeds (Adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea - see MPEP 2106.05(f) – Examiner’s note: high level recitation of training a deep learning model without significantly more) For the reasons above, Claim 1 is rejected as being directed to an abstract idea without significantly more. This rejection applies equally to dependent claims 1 - 6. The additional limitations of the dependent claims are addressed below. Regarding Claim 2: Step 2A Prong 1: See the rejection of Claim 1 above, which Claim 2 depends on. wherein the forward propagation operation, the backward propagation operation, and the weight update operation are repeatedly performed until […] converge to a predetermined range (mathematical concept - describes an iterative mathematical optimization process that repeatedly performs calculations until model parameters satisfy a convergence) Step 2A Prong 2 &amp; Step 2B: [….] weights of the deep learning model […] (Adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea - see MPEP 2106.05(f) – Examiner’s note: high level recitation of using weights of the deep learning model without significantly more) Accordingly, under Step 2A Prong 2 and Step 2B, this additional element does not integrate the abstract idea into practical application because it does not impose any meaningful limits on practicing the abstract idea, as discussed above in the rejection of claim 1. The claim does not include additional elements considered individually and in combination that are sufficient to amount to significantly more than the judicial exception. Regarding Claim 3: Step 2A Prong 1: See the rejection of Claim 1 above, which Claim 3 depends on. wherein the weight update operation limits weight update for a neuron of which activation tendency has a value greater than a predetermined value in the process in which training for the first task proceeds (mathematical concept - involves comparing a numerical activation tendency value with a predetermined threshold and adjusting the parameter update calculation accordingly; “activation tendency” is interpreted as a numerical value representing the historical activation behavior of a neuron within a predetermined range, as described in the specification - paragraph [0037]) Step 2A Prong 2 &amp; Step 2B: Accordingly, under Step 2A Prong 2 and Step 2B, there are no additional elements that integrate the abstract idea into practical application. The claim does not include additional elements considered individually and in combination that are sufficient to amount to significantly more than the judicial exception. Regarding Claim 4: Step 2A Prong 1: See the rejection of Claim 1 above, which Claim 4 depends on. obtaining a gradient (mathematical concept – requires performing calculations to determine the gradient of an objective function with respect to model weights, e.g., Algorithm 3 step 2: “g ← ∇ ₍w₎ f(w) (get gradients with objective function)”, which represents a mathematical optimization calculation) for a neuron of which activation tendency has a value greater than a predetermined value in the process in which training for the first task proceeds, reducing a weight by multiplying the obtained gradient by a predetermined constant (r) (mathematical concept - requires performing mathematical calculations including comparing a numerical activation tendency with a threshold and reducing the gradient value by multiplying it with a constant (r) to adjust the parameter update) and updating the weight using the reduced weight (mental process - updating the weight using the reduced weight may be performed manually by a user by observing the reduced value and applying the corresponding arithmetic update to the weight) Step 2A Prong 2 &amp; Step 2B: Accordingly, under Step 2A Prong 2 and Step 2B, there are no additional elements that integrate the abstract idea into practical application. The claim does not include additional elements considered individually and in combination that are sufficient to amount to significantly more than the judicial exception. Regarding Claim 5: Step 2A Prong 1: See the rejection of Claim 1 above, which Claim 5 depends on. Step 2A Prong 2 &amp; Step 2B: wherein, in a training process for the second task, the forward propagation operation and the backward propagation operation that are repeatedly performed proceed in parallel at least once (Field of Use – limitations that amount to merely indicating a field of use or technological environment in which to apply a judicial exception does not amount to significantly more than the exception itself, and cannot integrate a judicial exception into a practical application; in this case specifying that the forward propagation operation and the backward propagation operation that are repeatedly performed proceed in parallel does not integrate the exception into a practical application nor amount to significantly more – See MPEP 2106.05(h)) Accordingly, under Step 2A Prong 2 and Step 2B, this additional element does not integrate the abstract idea into practical application because it does not impose any meaningful limits on practicing the abstract idea, as discussed above in the rejection of claim 1. The claim does not include additional elements considered individually and in combination that are sufficient to amount to significantly more than the judicial exception. Regarding Claim 6: Step 2A Prong 1: See the rejection of Claim 1 above, which Claim 6 depends on. wherein the backward propagation operation that proceeds in parallel speculates a forward propagation outcome at a current time based on a result of the forward propagation operation of at least a previous time and performs backward propagation based on the speculated result (mathematical concept – requires performing mathematical calculations to estimate a forward propagation outcome based on prior results and to perform backward propagation computations using the estimated values) Step 2A Prong 2 &amp; Step 2B: Accordingly, under Step 2A Prong 2 and Step 2B, there are no additional elements that integrate the abstract idea into practical application. The claim does not include additional elements considered individually and in combination that are sufficient to amount to significantly more than the judicial exception. Regarding Claim 7: Step 1: Claim 7 is a method type claim. Therefore, Claims 7-11 fall within one of the four statutory categories (i.e., process, machine, manufacture, or composition of matter). 2A Prong 1: If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation by mathematical calculation but for the recitation of generic computer components, then it falls within the “Mathematical Concepts” grouping of abstract ideas. a forward propagation operation of performing a forward propagation (mathematical concept - mathematical calculations used to compute outputs of a neural network from input data) a backward propagation operation of performing a backward propagation (mathematical concept - backward propagation in a neural network involves mathematical calculations such as computing error derivatives, gradients, and propagating those values through layers to adjust model parameters) and a weight update operation of performing a weight update, wherein the forward propagation operation, the backward propagation operation, and the weight update operation are repeatedly performed, and the forward propagation operation and the backward propagation operation that are repeatedly performed after an initial execution proceed in parallel at least once (mathematical concept - involves performing iterative mathematical calculations including forward propagation, backward propagation, and parameter updates) Step 2A Prong 2 : Accordingly, under Step 2A Prong 2 and Step 2B, there are no additional elements that integrate the abstract idea into practical application. The claim does not include additional elements considered individually and in combination that are sufficient to amount to significantly more than the judicial exception. For the reasons above, Claim 7 is rejected as being directed to an abstract idea without significantly more. This rejection applies equally to dependent claims 7 - 11. The additional limitations of the dependent claims are addressed below. Regarding Claim 8: Step 2A Prong 1: See the rejection of Claim 7 above, which Claim 8 depends on. wherein the backward propagation operation that proceeds in parallel speculates a forward propagation outcome at a current time based on a result of the forward propagation operation of at least a previous time and performs backward propagation based on the speculated result (mathematical concept – requires performing calculations to speculate the forward propagation outcome at time t= i using the forward propagation outcome at time t=(i−1) and the forward propagation outcome at time t=(i−2) and using the speculated forward propagation outcome to perform the backward propagation operation) Step 2A Prong 2 &amp; Step 2B: Accordingly, under Step 2A Prong 2 and Step 2B, there are no additional elements that integrate the abstract idea into practical application. The claim does not include additional elements considered individually and in combination that are sufficient to amount to significantly more than the judicial exception. Regarding Claim 9: Step 2A Prong 1: See the rejection of Claim 8 above, which Claim 9 depends on. wherein the forward propagation outcome speculated at the time t= i is speculated by assigning a greater weight to […] at the time t=(i-2) than […] at the time t=(i-1) (mental process - assigning a greater weight may be performed by a user observing and analyzing the results at time t=(i−2) and time t=(i−1) and using judgment or evaluation to assign a greater weight to one result than the other) Step 2A Prong 2 &amp; Step 2B: […] a result of the deep learning model […] a result of the deep learning model [..] (Adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea - see MPEP 2106.05(f) – Examiner’s note: high level recitation of using a result of the deep learning model without significantly more) Accordingly, under Step 2A Prong 2 and Step 2B, this additional element does not integrate the abstract idea into practical application because it does not impose any meaningful limits on practicing the abstract idea, as discussed above in the rejection of claim 8. The claim does not include additional elements considered individually and in combination that are sufficient to amount to significantly more than the judicial exception. Regarding Claim 10: Step 2A Prong 1: See the rejection of Claim 8 above, which Claim 10 depends on. Step 2A Prong 2 &amp; Step 2B: wherein an activation status of each of the neurons at the time t= i is speculated based on the activation tendency of each of the neurons included in the deep learning model up to the time t=i-1 (Adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea - see MPEP 2106.05(f) – Examiner’s note: high level recitation of using the neurons included in the deep learning model without significantly more) Accordingly, under Step 2A Prong 2 and Step 2B, this additional element does not integrate the abstract idea into practical application because it does not impose any meaningful limits on practicing the abstract idea, as discussed above in the rejection of claim 8. The claim does not include additional elements considered individually and in combination that are sufficient to amount to significantly more than the judicial exception. Regarding Claim 11: Step 2A Prong 1: See the rejection of Claim 7 above, which Claim 11 depends on. Step 2A Prong 2 &amp; Step 2B: wherein, in a training process for a second task performed after training for a first task is completed, the update operation is performed based on an activation tendency of […] (Field of Use – limitations that amount to merely indicating a field of use or technological environment in which to apply a judicial exception does not amount to significantly more than the exception itself, and cannot integrate a judicial exception into a practical application; in this case specifying that in a training process for a second task performed after training for a first task is completed, the update operation is performed based on an activation tendency does not integrate the exception into a practical application nor amount to significantly more – See MPEP 2106.05(h)) […] each of neurons included in the deep learning model in a process in which training for the first task proceeds (Adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea - see MPEP 2106.05(f) – Examiner’s note: high level recitation of using the neurons included in the deep learning model without significantly more) Accordingly, under Step 2A Prong 2 and Step 2B, this additional element does not integrate the abstract idea into practical application because it does not impose any meaningful limits on practicing the abstract idea, as discussed above in the rejection of claim 7. The claim does not include additional elements considered individually and in combination that are sufficient to amount to significantly more than the judicial exception. Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 1-6 and 10-11 are rejected under 35 U.S.C. 103 as being unpatentable over Butvinik et al. (hereafter Butvinik ) (US 20220261633), in view of Naumov et al. (hereinafter Naumov) (US 20190188569) and further in view of Park et al. (hereinafter Park), a non-patent literature reference titled “Continual Learning with Speculative Backpropagation and Activation History”). Regarding Claim 1, Butvinik teaches a continual learning method of a deep learning model ( Butvinik , Par. [0010], “method for training a machine learning model using incremental learning without forgetting”, thus a continual learning method is disclosed) performed by a computing device ( Butvinik , Par. [0149], “Computing device”, thus a computing device is disclosed) comprising at least a processor ( Butvinik , Par. [0151], “One or more processor(s)”, thus a processor is disclosed) , for continual learning for a second task and an nth task for the deep learning model trained for a first task ( Butvinik , Par. [0004], “when a neural network is used to learn a sequence of tasks, the learning of the later tasks may degrade the performance of the models learned for the earlier tasks. OIL algorithms try to achieve this same ability for neural networks and to solve the catastrophic forgetting problem. Thus, in essence, continual learning performs incremental learning of new tasks”, thus continual learning for a second task and an nth task for the deep learning model trained for a first task is disclosed because Butvinik explains that neural networks are trained on a sequence of tasks, where later tasks are learned after earlier ones. Thus, a model trained on an initial task can subsequently learn additional tasks (e.g., a second task and further tasks up to an nth task) in a continual learning without forgetting previous knowledge) , the continual learning method comprising: […] in a process in which training for the first task proceeds ( Butvinik , Par. [0028], “The model is characterized by parameters p that change upon training each new task i . The model is sequentially trained for each task i using data X and associated labels y to form a decision boundary f(x;θi−1 ), where θ represents, e.g., parameters or weights of the neural network of FIG. 1. The decision boundaries may be applied to data to yield a decision. When training incrementally, after the first task i =1 is trained to generate an initial decision boundary 100 , training a subsequent second task 1=2 changes the decision boundary 102 causing the model to forget the initial task training (e.g., data that previously fell on one side of the decision boundary is updated to fall on a different side of the decision boundary). Thus, the model can drastically lose its accuracy to predict on an initial task after being trained on a new task. This usually means a new task will likely override many of the weights that have been learned in the past, and thus degrade the model performance for the past tasks, thus in a process in which training for the first task proceeds is disclosed because Butvinik teaches that the neural network model is trained sequentially on tasks and that, during training for a task (e.g., the first task i = 1), the model parameters or weights are adjusted using training data to form an initial decision boundary. This corresponds to training for the first task proceeding through updates to the model parameters based on the training data for that task) Butvinik does not explicitly teach a forward propagation operation of performing a forward propagation, a backward propagation operation of performing a backward propagation, and a weight update operation of performing a weight update, wherein the forward propagation operation, the backward propagation operation, and the weight update operation are repeatedly performed, and the update operation is performed. However, Naumov teaches a forward propagation operation of performing a forward propagation, a backward propagation operation of performing a backward propagation, and a weight update operation of performing a weight update, wherein the forward propagation operation, the backward propagation operation, and the weight update operation are repeatedly performed, and the update operation is performed. a forward propagation operation of performing a forward propagation (Naumov, Par. [0003], “Training a neural network model that includes multiple layers is accomplished by propagating input training data forward through each layer to produce output data”, &amp; Par. [0029], “As previously explained, forward and backward propagation are used in training of neural networks. The training is an optimization procedure that minimizes the loss function over data samples (x*,z*) in a data set D. The loss function measures on average the error ε(.,.) between the computed output of the neural network, z(l) and the correct solution z*, e.g. cross entropy error function”, &amp; Par. [0030], “The forward propagation starts with an input x*. An affine function θk is applied followed by a component-wise application of a non-linear activation function ƒk to obtain an output of a layer. Propagation proceeds sequentially through a composition of layers k=1, . . . , l defining the neural network ϕ. As a result, an output z(l) is computed at the final layer, thus a forward propagation operation of performing a forward propagation is disclosed because Naumov teaches propagating input training data forward through each layer to produce output data, which corresponds to performing the forward propagation operation, and further teaches that forward propagation starts with an input and proceeds sequentially through layers of the neural network to compute an output at the final layer, which corresponds to the forward propagation operation of the deep learning model) a backward propagation operation of performing a backward propagation (Naumov, Par. [0003], “The error data is then back propagated through each layer, starting at the last layer, to update parameter values associated with each layer”, &amp; Par. [0029], “As previously explained, forward and backward propagation are used in training of neural networks. The training is an optimization procedure that minimizes the loss function over data samples (x*,z*) in a data set D. The loss function measures on average the error ε(.,.) between the computed output of the neural network, z(l) and the correct solution z*, e.g. cross entropy error function”, &amp; Par. [0031], “The backward propagation proceeds sequentially backwards through a composition of layers k=l, . . . , 1 defining the neural network ϕ. As a result, the backward propagation computes errors v(k) = ∇ εk · f′k at all layers”, thus a backward propagation operation of performing a backward propagation is disclosed because Naumov teaches that error data is back propagated through each layer starting at the last layer to update parameter values associated with each layer, which corresponds to performing the backward propagation operation, and further teaches that backward propagation proceeds sequentially backward through the layers of the neural network to compute errors at all layers, which corresponds to the backward propagation operation of the deep learning model) and a weight update operation of performing a weight update, wherein the forward propagation operation, the backward propagation operation, and the weight update operation are repeatedly performed, and the update operation is performed […] (Naumov, Par. [0027], “In an embodiment, the parameters include weights and the odd error gradient and even error gradients are used to compute delta (difference) values that are combined with the weights to update the weights for each layer. In an embodiment, the parameters include bias delta values. In an embodiment, the error gradients and consequently the updated parameters for the layers of the parallel neural networks equal the updated parameters for the corresponding layer of the original neural network model 110 . In other words, the values propagated forward and backward through the layers of the parallel neural networks are mathematically equivalent to the values propagated forward and backward for the original neural network model”, Par. [0032], “The errors v(k) can then be used to update coefficients of functions θk . In particular, in equation (4) these coefficients are the weights W(k) and bias b(k), where the following can be written: ,thus a weight update operation of performing a weight update is disclosed because Naumov teaches that error gradients produced during training are used to compute delta values that are combined with the weights to update the weights for each layer, which corresponds to performing the weight update operation, and further teaches that the computed errors are used to update the coefficients of the neural network functions, including the weights W(k) and bias b(k), as shown by the weight update equation ΔW(k) = ∂ε / ∂w(k), which corresponds to performing the weight update operation during neural network training) It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the continual learning method of Butvinik with the neural network training operations and parallel neural network architecture of Naumov because Naumov teaches performing neural network operations using parallel neural networks to improve processing performance (Naumov, Par. [0061], “[0061] Steps 210 and 220 may be performed in parallel. In an embodiment, the odd input data is processed by the odd neural network model 120 according to the updated odd parameter values simultaneously with processing of the even input data by the even neural network model 130 according to the updated even parameter values to produce intermediate odd data and intermediate even data at each layer and the odd and even outputs at the last layers. In contrast, when the input is processed by the neural network model 110 , the parallel processing across layers is not possible. Therefore, the processing performance is increased, possibly doubled, for the parallel neural network system 100 compared with the neural network model 110 . In an embodiment, the even and odd outputs are computed by the even neural network model 130 and the odd neural network model 120 according to equations (31) and (30), respectively.”, thus Naumov teaches that neural network operations, including forward propagation, backward propagation, and parameter update operations, may be executed using parallel neural network models that process data simultaneously, thereby increasing processing performance compared with a single sequential neural network, and therefore Naumov’s parallel neural network processing techniques may be incorporated into the continual learning method of Butvinik so that the neural network training operations are performed using parallel neural networks, thereby increasing processing performance) Butvinik combined with Naumov does not explicitly teach […] based on an activation tendency of each of neurons included in the deep learning model […]. However, Park teaches […] based on an activation tendency of each of neurons included in the deep learning model […]. […] based on an activation tendency of each of neurons included in the deep learning model […] (Park, Page 5 – Section IV, “A variable called activation history (ahl j ) stores the tendency of neuron’s activation while training for the previous tasks. After training for each task, ahl j is adjusted with the biased ReLU computed in Algorithm 1. r specifies the degree of weight update and directly affects the knowledge preservation performance”, thus updating based on an activation tendency of each neuron included in the deep learning model is disclosed because Park teaches storing an activation history variable for each neuron that records the tendency of the neuron’s activation during training of previous tasks, and further teaches adjusting the weight update using this activation history value and a parameter r that controls the degree of weight update, which corresponds to performing the update operation based on the activation tendency of neurons in the deep learning model) It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the continual learning method of Butvinik with the neural network training operations and parallel neural network architecture of Naumov and further combine the technique taught by Park of using an activation history of neurons to guide weight updates in continual learning systems to mitigate catastrophic forgetting. Park explains that activation history isolates weights that are important for previous tasks so that those weights are updated in a controlled manner during subsequent training. Park further teaches that this technique improves knowledge preservation and reduces training time compared with existing continual learning approaches. Incorporating Park’s activation history based update techniques into the combined Butvinik and Naumov system allows the continual learning model to perform weight updates based on neuron activation tendencies while improving knowledge preservation and training efficiency during continual learning (Park, Page 1 – Abstract, “Speculative Backpropagation (SB) and Activation History (AH). The SB enables performing backpropagation based on past knowledge. The AH enables isolating important weights for the previous task. We evaluated the performance of our scheme in terms of accuracy and training time. The experiment results show a 4.4% improvement in knowledge preservation and a 31% reduction in training time, compared to the state-of-the-arts (EWC and SI)”, thus Park teaches that using an activation history of neurons to guide weight updates during continual learning improves knowledge preservation and reduces training time, thereby enabling the continual learning model to update weights based on neuron activation tendencies while preserving knowledge learned from previous tasks and improving the efficiency of the training process) Regarding Claim 2 , Butvinik combined with Naumov and further combined with Park teaches all of the limitations of claim 1 as cited above and Naumov further teaches: wherein the forward propagation operation, the backward propagation operation, and the weight update operation are repeatedly performed until weights of the deep learning model converge to a predetermined range (Naumov, Par. [0029], “As previously explained, forward and backward propagation are used in training of neural networks. The training is an optimization procedure that minimizes the loss function over data samples (x*,z*) in a data set D. The loss function measures on average the error ε(.,.) between the computed output of the neural network, z(l) and the correct solution z*, e.g. cross entropy error function”, &amp; Par. [0033], “Therefore, the stochastic gradient descent and its variants, often average updates of the coefficients across the mini-batch”, thus repeatedly performing forward propagation, backward propagation, and weight update operations until the weights of the deep learning model converge to a predetermined range is disclosed because Naumov teaches that neural network training is an optimization procedure that minimizes a loss function over training data and further teaches updating model coefficients using stochastic gradient descent and its variants across training iterations, which corresponds to repeatedly performing propagation and weight update operations until the model parameters converge to a stable or optimized range defined by the training process) Regarding Claim 3 , Butvinik combined with Naumov and further combined with Park teaches all of the limitations of claim 1 as cited above and Park further teaches: wherein the weight update operation limits weight update for a neuron of which activation tendency has a value greater than a predetermined value in the process in which training for the first task proceeds (Park, Page 5 – Section IV, “A variable called activation history (ahl j ) stores the tendency of neuron’s activation while training for the previous tasks. After training for each task, ahl j is adjusted with the biased ReLU computed in Algorithm 1. r specifies the degree of weight update and directly affects the knowledge preservation performance. In proportion to r, the weights connected to activated neurons in the past are updated less (line# 3-4), when training for a new task”, &amp; Page 5 – Section IV/ Algorithm 3, “3: if ahl j &gt; 0.5 do (if the neuron is activated with high probability for the previous task) 4: g ← g · r (reduce the gradients of the weights) 5: w ← w − η · g (update the weight)”, thus limiting the weight update for a neuron whose activation tendency has a value greater than a predetermined value is disclosed because Park teaches storing an activation history value for each neuron that represents the tendency of the neuron’s activation during previous tasks, and further teaches that when the activation history value exceeds a threshold (e.g., ahl_j &gt; 0.5), the gradient used for updating the weight is reduced by multiplying it by a factor r, which results in the weights connected to previously activated neurons being updated less during training of a new task, corresponding to limiting the weight update for neurons whose activation tendency exceeds a predetermined value) Regarding Claim 4 , Butvinik combined with Naumov and further combined with Park teaches all of the limitations of claim 1 as cited above and Park further teaches: obtaining a gradient (Park, Page 5 – Section IV- Algorithm 3, “1: while weight not converged do 2: g ← ∇ wf (w) (get gradients with objective function) 3: if ahl j &gt; 0.5 do (if the neuron is activated with high probability for the previous task) 4: g ← g · r (reduce the gradients of the weights)”, thus obtaining a gradient is disclosed because Park teaches computing gradients during the training process, where the algorithm explicitly calculates the gradient g ← ∇ wf (w) using the objective function before applying any modification to the gradient for weight updates, which corresponds to obtaining a gradient for use in updating the weights of the neural network) for a neuron of which activation tendency has a value greater than a predetermined value in the process in which training for the first task proceeds, reducing a weight by multiplying the obtained gradient by a predetermined constant (r) (Park, Page 5 – Section IV, “A variable called activation history (ahl j ) stores the tendency of neuron’s activation while training for the previous tasks. After training for each task, ahl j is adjusted with the biased ReLU computed in Algorithm 1. r specifies the degree of weight update and directly affects the knowledge preservation performance. In proportion to r, the weights connected to activated neurons in the past are updated less (line# 3-4), when training for a new task”, &amp; Page 5 – Section IV – Algorithm 3, “1: while weight not converged do 2: g ← ∇ wf (w) (get gradients with objective function) 3: if ahl j &gt; 0.5 do (if the neuron is activated with high probability for the previous task) 4: g ← g · r (reduce the gradients of the weights) 5: w ← w − η · g (update the weight) 6: end while”, thus reducing a weight for a neuron whose activation tendency exceeds a predetermined value by multiplying the obtained gradient by a predetermined constant r is disclosed because Park teaches storing an activation history value that represents the tendency of a neuron’s activation during previous tasks and further teaches that when this value exceeds a threshold ( ahl_j &gt; 0.5), the gradient used for updating the weight is multiplied by a constant r according to g ← g · r, which reduces the gradient and results in the weights connected to previously activated neurons being updated less during subsequent training) and updating the weight using the reduced weight (Park, Page 5 – Section IV – Algorithm 3, “1: while weight not converged do 2: g ← ∇ wf (w) (get gradients with objective function) 3: if ahl j &gt; 0.5 do (if the neuron is activated with high probability for the previous task) 4: g ← g · r (reduce the gradients of the weights) 5: w ← w − η · g (update the weight) 6: end while”, thus updating the weight using the reduced weight is disclosed because Park teaches that after reducing the gradient by multiplying it by the constant r when the activation history exceeds the threshold, the weight is updated according to w ← w − η · g, which corresponds to updating the weight using the reduced gradient during the weight update step of the training process) Regarding Claim 5 , Butvinik combined with Naumov and further combined with Park teaches all of the limitations of claim 1 as cited above and Naumov further teaches: wherein, in a training process for the second task, the forward propagation operation and the backward propagation operation that are repeatedly performed proceed in parallel at least once (Naumov, Par. [0016], “Computations may be performed in parallel during forward and/or backward propagation through the parallel neural networks. Parallel computations may accelerate the forward and backward propagation operations with an increase in memory consumption, but no significant change in accuracy. Furthermore, restructuring a single neural network into two or more parallel neural networks reduces the total time needed for training”, &amp; Par. [0021], “The odd error gradient is propagated backwards through each successive odd processing layer in the odd neural network model 120 to update odd parameter values for each of the odd processing layers. The even error gradient is propagated backwards through each successive even processing layer in the even neural network model 130 to update even parameter values for each of the even processing layers. The exact backward propagation may be performed simultaneously through the odd neural network model 120 and the even neural network model 130 , resulting in faster training”, thus the forward propagation operation and the backward propagation operation proceeding in parallel at least once during training is disclosed because Naumov teaches that computations may be performed in parallel during forward and backward propagation through parallel neural networks, and further teaches that backward propagation may be performed simultaneously through separate neural network models (e.g., odd and even neural network models), which corresponds to the forward and backward propagation operations being executed in parallel during the neural network training process) Regarding Claim 6 , Butvinik combined with Naumov and further combined with Park teaches all of the limitations of claim 1 as cited above and Naumov further teaches: wherein the backward propagation operation that proceeds in parallel speculates a forward propagation outcome at a current time based on a result of the forward propagation operation of at least a previous time and performs backward propagation based on the speculated result (Naumov, Par. [0039], “Notice that in forward propagation the points y(k) and fk are not known ahead of time. However, the points y(k) and fk can potentially be estimated from a previous pass over the data. The approximation can be effective in the later stages of training, when the training is close to the solution. Therefore, the forward propagation can be approximated by solving equation”, thus the backward propagation operation that proceeds in parallel speculates a forward propagation outcome at a current time based on a result of the forward propagation operation of at least a previous time and performs backward propagation based on the speculated result is disclosed because Naumov teaches that the values used in forward propagation, such as the intermediate points y(k) and fk , may be estimated using results from a previous pass over the data, and that such approximations can be used during training, which corresponds to speculating a forward propagation outcome based on results from a previous iteration and performing subsequent computations, including backward propagation, using the estimated values) Regarding Claim 10 , Butvinik combined with Naumov teaches all of the limitations of claim 8 as cited above: Butvinik combined with Naumov does not explicitly teach wherein an activation status of each of the neurons at the time t= i is speculated based on the activation tendency of each of the neurons included in the deep learning model up to the time t=i-1. However, Park teaches wherein an activation status of each of the neurons at the time t= i is speculated based on the activation tendency of each of the neurons included in the deep learning model up to the time t=i-1. wherein an activation status of each of the neurons at the time t= i is speculated based on the activation tendency of each of the neurons included in the deep learning model up to the time t=i-1 (Park, Page 4 – Section IV, “The biased ReLU is adjusted by the current activation outcome whenever the forward propagation is finished. When the neurons are deactivated, the biased ReLU becomes closer to 0. When the neurons are activated, it gets closer to 1. Algorithm 2 shows a method of speculating the neuron’s activation based on the biased ReLU . When the biased ReLU is smaller than 0.5, it predicts the neuron will be deactivated. Otherwise, it speculates the neuron will be activated”, thus speculating the activation status of a neuron based on its activation tendency is disclosed because Park teaches maintaining a biased ReLU value that reflects the historical activation behavior of each neuron and further teaches predicting whether the neuron will be activated or deactivated based on that accumulated activation tendency) It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the continual learning method of Butvinik with the neural network training operations and parallel neural network architecture of Naumov and further combine the technique taught by Park of using an activation history of neurons to guide weight updates in continual learning systems to mitigate catastrophic forgetting. Park explains that activation history isolates weights that are important for previous tasks so that those weights are updated in a controlled manner during subsequent training. Park further teaches that this technique improves knowledge preservation and reduces training time compared with existing continual learning approaches. Incorporating Park’s activation history based update techniques into the combined Butvinik and Naumov system allows the continual learning model to perform weight updates based on neuron activation tendencies while improving knowledge preservation and training efficiency during continual learning (Park, Page 1 – Abstract, “Speculative Backpropagation (SB) and Activation History (AH). The SB enables performing backpropagation based on past knowledge. The AH enables isolating important weights for the previous task. We evaluated the performance of our scheme in terms of accuracy and training time. The experiment results show a 4.4% improvement in knowledge preservation and a 31% reduction in training time, compared to the state-of-the-arts (EWC and SI)”, thus Park teaches that using an activation history of neurons to guide weight updates during continual learning improves knowledge preservation and reduces training time, thereby enabling the continual learning model to update weights based on neuron activation tendencies while preserving knowledge learned from previous tasks and improving the efficiency of the training process) Regarding Claim 11 , Butvinik combined with Naumov teaches all of the limitations of claim 7 as cited above Butvinik further teaches: wherein, in a training process for a second task performed after training for a first task is completed, the update operation is performed […] ( Butvinik , Par. [0004], “when a neural network is used to learn a sequence of tasks, the learning of the later tasks may degrade the performance of the models learned for the earlier tasks. OIL algorithms try to achieve this same ability for neural networks and to solve the catastrophic forgetting problem. Thus, in essence, continual learning performs incremental learning of new tasks”, &amp; Par. [0010], “The machine learning model may be trained in a sequence of a plurality of sequential training iterations respectively associated with the sequence of a plurality of training tasks. In each of the plurality of sequential training iterations the machine learning model is trained by generating the task-specific parameters for the current training iteration by applying a propagator to the one or more training samples associated with the current training task”, thus performing the update operation during training for a second task after training for a first task is disclosed because Butvinik teaches training a machine learning model across a sequence of training tasks and further teaches that in each sequential training iteration the model is trained using samples associated with the current task, which corresponds to performing training operations for a subsequent task after completion of training for a previous task) Butvinik combined with Naumov does not explicitly teach […] based on an activation tendency of each of neurons. However, Park teaches […] based on an activation tendency of each of neurons. […] based on an activation tendency of each of neurons (Park, Page 5 – Section IV, “A variable called activation history
Read full office action
DEVICE AND METHOD FOR CONTINUAL LEARNING BASED ON SPECULATIVE BACKPROPAGATION AND ACTIVATION HISTORY

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

DEVICE AND METHOD FOR CONTINUAL LEARNING BASED ON SPECULATIVE BACKPROPAGATION AND ACTIVATION HISTORY

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email