DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Specification
The title of the invention is not descriptive. A new title is required that is clearly indicative of the invention to which the claims are directed.
Claim Rejections - 35 USC § 101
According to the first part of the analysis, in the instant case, claims 1-9 are directed to an apparatus, claims 10-15 are directed to a method. Each of these claims fall within one of the four statutory categories (i.e., process, machine, manufacture, or composition of matter).
For claim 1,
Step 2A Prong One
obtain a first loss function based on output data, obtained by inputting the learning data to the neural network model, and a label corresponding to the learning data,
(This step for obtaining a loss function based on data and a label is considered a mental process)
obtain a size of a weight change amount of each of a plurality of layers included in the neural network model based on the first loss function,
(This step for obtaining a size of a weight change is considered a mental process)
Step 2A Prong Two
An electronic device comprising: a memory storing a pre-trained neural network model and learning data; and a processor configured to:
(This step for executing the mental processes is using a generic computing device is considered mere instructions to apply an exception. See MPEP § 2106.05(f))
and train the neural network model by updating a weight of at least one layer, among the plurality of layers, for which a size of the weight change amount exceeds a first threshold value, wherein at least one other layer, among the plurality of layers, for which a size of the weight change amount does not exceed the first threshold value is not updated.
(This step for training a neural network based on the weight change threshold is considered extra solution activity. See MPEP § 2106.05(g))
Step 2B
The claim recites mental processes such as obtaining a loss function and a weight change amount. However, the additional element of training the neural network is specific enough that it is not considered well understood, routine, and conventional activity, which therefore includes the abstract ideas into a practical application. Claim one is therefore not rejected under 101.
For claim 2
Step 2A Prong One
The electronic device of claim 1, wherein the processor is further configured to: in a direction from an output layer to an input layer of the neural network model, identify an initial first layer for which the size of the weight change amount is less than the first threshold value,
(This step for identifying an initial first layer is considered a mental process)
Step 2A Prong Two
and train the neural network model by updating a weight of at least one layer previous to the identified first layer in the direction from the output layer to the input layer.
(This step for training a neural network by updating weights of specified layers is considered extra solution activity. See MPEP § 2106.05(g))
Step 2B
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception because, when considered individually and in combination, they do not add significantly more (also known as an inventive concept) to the exception. The claim recites mental processes such as identifying a first layer while the additional element of generically updating weights of specified layers of a neural network at a high level of generality is a well-understood, routine, and conventional activity, as recognized by the court decisions listed in MPEP § 2106.05(d).
For claim 4,
Step 2A Prong One
obtain, for each layer of the plurality of layers, a difference between a size of weight change amount of the layer obtained in an i+1th training of the neural network model and the stored size of weight change amount of the layer,
(This step for obtaining a difference in weight change sizes is considered a mental process)
Step 2A Prong Two
The electronic device of claim 1, wherein the processor is further configured to: store, in the memory, a size of weight change amount of each layer of the plurality of layers obtained based on the neural network model being trained i times,
(This step for storing data is considered extra-solution activity. See MPEP § 2106.05(g))
and train the neural network model by updating the weight of at least one layer for which the obtained difference is greater than or equal to a third threshold value.
(This step for training specified layers of a neural network is considered extra solution activity)
Step 2B
The claim recites mental processes such as obtaining a difference in weight changes. However, the additional element of training the neural network by updating a weight for a layer when the obtained difference of weight change amounts is greater than a threshold is not well understood routine and conventional activity. Therefore, claim 4 is not rejected under 101.
For claim 6,
Step 2A Prong One
obtain a second loss function based on output data, obtained by inputting the learning data to a neural network model into which the third layer is inserted, and the label corresponding to the learning data,
(This step for obtaining a loss function based on data and a label is considered a mental process)
Step 2A Prong Two
The electronic device of claim 1, wherein the processor is further configured to: insert a third layer into a region of at least one of the plurality of layers,
(This step for inserting a layer into a neural network is considered extra-solution activity. See MPEP § 2106.05(g))
And train the neural network model by updating a weight of the third layer based on the second loss function.
(This step for training a neural network by updating weights is considered extra-solution activity. See MPEP § 2106.05(g))
Step 2B
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception because, when considered individually and in combination, they do not add significantly more (also known as an inventive concept) to the exception. The claim recites mental processes while the additional elements of inserting layers into a neural network and updating weights of a neural network at a high level of generality are a well-understood, routine, and conventional activity, as recognized by the court decisions listed in MPEP § 2106.05(d).
For claim 8,
Step 2A Prong One
The electronic device of claim 1, wherein the processor is further configured to: set a window to include at least a fourth layer consecutively connected among the plurality of layers and data related to the fourth layer,
(This step for setting a window of layers is considered a mental process)
and following completion of the operation: slide the window by a preset unit relative to the plurality of layers, such that the fourth layer is newly excluded from the window and a fifth layer is newly included in the window,
(This step for excluding and including layers in the set window is considered a mental process)
Step 2A Prong Two
perform an operation by loading each layer and data included in the window,
(This step for loading data is considered extra-solution activity. See MPEP § 2106.05(g))
unload the fourth layer and data related solely to the fourth layer, and load the fifth layer and data related to the fifth layer.
(This step for loading and unloading data is considered extra-solution activity. See MPEP § 2106.05(g))
Step 2B
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception because, when considered individually and in combination, they do not add significantly more (also known as an inventive concept) to the exception. The claim recites mental processes while the additional elements of loading and unloading data at a high level of generality are a well-understood, routine, and conventional activity, as recognized by the court decisions listed in MPEP § 2106.05(d).
For claims 10 through 15;
Claims 10 through 15 are method claims directly corresponding to claims 1 through 6, respectively, and are therefore rejected for the same reasoning.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
NOTE: Under BRI, a “weight change amount” can be read as any quantity that represents, determines, or directly drives how a layer’s weights should be changed (gradient, delta weight, update step, norm/magnitude of any of these, etc.), using the following excerpt from the applicant spec:
“[0040] The weight change amount of each of a plurality of layers means the number to be changed so that weights of each of the plurality of layers can minimize the first loss function value. The size of the weight change amount may be expressed as weight loss, size of return derivative, or size of differential (e.g., L2 norm of derivative).”
Claim(s) 1, 5, 10, 14 is/are rejected under 35 U.S.C. 103 as being unpatentable Yang Lin et al, (hereinafter Yang) (US 20200302276 A1, 2020-09-24) in view of Kai Yutaka et al. (hereinafter Kai) (US 12505379 B2, 2025-12-23).
Regarding claim 1, Yang teaches;
An electronic device comprising: a memory storing a pre-trained neural network model and learning data;
([0022] In some examples, the AI chip in the AI system 114 may include an embedded cellular neural network that has memory containing the multiple parameters in the CNN. In some scenarios, the memory in an AI chip may be a one-time-programmable (OTP) memory that allows a user to load a CNN model into the physical AI chip once.)
NOTE: Teaches a memory storing a pre-trained neural network model
([0024] In some scenarios, training data may reside in a memory in a host device.)
NOTE: Teaches the memory storing learning data.
and a processor configured to:
([0022] In other examples, the AI chip may include a subset of the convolutional, Pooling, and ReLU layers in a CNN model. In such case, the AI chip may perform certain computations in an AI task, leaving the remaining computations in the AI task performed in a CPU/GPU or other host processors outside the AI chip.)
NOTE: Teaches a processor for performing the methods of the disclosure.
obtain a first loss function based on output data, obtained by inputting the learning data to the neural network model, and a label corresponding to the learning data,
PNG
media_image1.png
262
802
media_image1.png
Greyscale
NOTE: Teaches obtaining a first loss function based on output data (prediction of the network), obtained by inputting the learning data to the neural network model (output of the CNN on the ith training instance), and a label corresponding to the learning data (the CNN output includes two image labels).
obtain a size of a weight change amount of each of a plurality of layers included in the neural network model based on the first loss function,
([0047] Returning to FIG. 4, in some examples, the operations 408 {determining a change of weights} and 410 {updating the weights} may also be performed in a layer by layer fashion in a backward propagation, in which a change of weights is determined for each layer in a CNN from the last layer to the first layer (or a subset of the convolution layers in the CNN)… and the changes of weights may be determined based on the loss function. This is further explained.)
NOTE: Teaches obtaining a size of a weight change amount of each of a plurality of layers included in the neural network model (change of weights is determined for each layer in the CNN) based on the first loss function (the change of weights is determined based on the loss function).
Yang fails to teach but Kai teaches;
and train the neural network model by updating a weight of at least one layer, among the plurality of layers, for which a size of the weight change amount exceeds a first threshold value,
wherein at least one other layer, among the plurality of layers, for which a size of the weight change amount does not exceed the first threshold value is not updated.
([col. 3, lines 38-51] Specifically, for example, the reference technique detects a layer in which a learning rate indicating a progress of learning is deteriorated and omits learning with respect to the layer so as to shorten the learning time. For example, in each layer in which a difference between an error gradient at the time of current iteration and an error gradient at the time of previous iteration is equal to or more than a threshold, learning is performed as usual at the time of next iteration. In each layer in which the difference is less than the threshold, learning skip is performed at the time of next iteration. In other words, for example, in the layer in which the learning rate is deteriorated, the subsequent machine learning processing for calculating an error gradient or the like is suppressed.)
NOTE: Discloses performing the learning processes for layers having a size of the weight change amount (difference between error gradient between iterations) greater than a threshold, and skipping the learning processing for layers having weight change amount less than a threshold.
([col. 4, lines 5-20] Here, an example of the learning skip used in the first embodiment will be described. FIG. 3 is a diagram for explaining machine learning of the information processing device 10 according to the first embodiment. As illustrated in FIG. 3, in deep learning of a machine learning model, machine learning (calculation processing) through forward propagation and processing for updating weights or the like through backward propagation are executed. Therefore, the information processing device 10 stops update of weight information from an iteration in which learning is progressed to some extent at the time of updating through the backward propagation. At this time, the update in an input-side layer is stopped first. This is because, although there is a case where learning accuracy does not reach target accuracy when stopping an output side, the effect on the accuracy on the input side is low.)
NOTE: The aforementioned learning processes include updating weights of participating layers (the aforementioned layers having a gradient difference greater than the specified threshold) through backward propagation, and skipped layers (the aforementioned layers having a gradient difference less than the specified threshold) are not updated. This therefore teaches training the neural network model by updating a weight of at least one layer, among the plurality of layers, for which a size of the weight change amount exceeds a first threshold value (layers exceeding gradient difference threshold), wherein at least one other layer, among the plurality of layers, for which a size of the weight change amount does not exceed the first threshold value is not updated (layers not exceeding gradient difference threshold).
OBVIOUSNESS TO COMBINE YANG WITH KAI:
Yang and Kai are analogous art to each other and the present disclosure as they all pertain to neural networks. Specifically, Yang pertains to training and compressing a convolutional neural network while Kai pertains to a method for improving the process of a learning skip in a machine learning model.
Additionally, Yang discloses a method for obtaining a size of weight change amount for each layer of a neural network, while Kai uses a threshold based on a size of weight change amount to determine which layers to skip in the learning process. Both of these elements perform the same functions in combination as they do separately. Therefore, obtaining the size of weight change amount as taught by Yang to compare layers with the size of weight change amount threshold disclosed by Kai would be combining prior art elements according to known methods to yield predictable results.
Kai further states;
([col. 3, lines 38-41] Specifically, for example, the reference technique detects a layer in which a learning rate indicating a progress of learning is deteriorated and omits learning with respect to the layer so as to shorten the learning time.)
NOTE: This excerpt details that by skipping updating of layers using the method disclosed by Kai, the learning time is shortened, which improves the efficiency of the system.
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to use the change of weights calculation disclosed by Yang in the learning skip determination disclosed by Kai to improve the efficiency of the neural network during learning.
Regarding claim 5, Yang in view of Kai teaches;
The electronic device of claim 1, wherein the processor is further configured to:
(Using the same reasoning as claim 1)
Yang teaches;
based on connection between a first layer of which
([0055] In some examples, the process 400 may combine the re-training with variable compression schemes. For example, for a given convolution layer in the CNN, if the quantization bits exceed a threshold (high quantization bits), the process 400 may skip updating the weights for that given layer... In the example above, the convolution layers whose weights are quantized at higher bits (lower compression ratio) are still participating in the re-training process, except no weights for those layers are updated. This results in the speedup of the training process. --- [0037] In the example in FIG. 3B, the change of weights ΔW.sub.A for layer A (306) may be determined based on the change of weights ΔW.sub.A+1; the change of weights ΔW.sub.A+1 for layer A+1 (308) may be determined based on the change of weights ΔW.sub.A+2, so on and so forth; ... In the example in FIG. 3B, if the weights of a layer is to be re-trained, the updated weights at time t+1 are determined based on the change of weights for that layer.)
NOTE: Teaches based on connection between a first layer (layer A+1 for example) of which an amount exceeds a threshold value and a second layer (layer A for example) in a skip connection structure (certain layers [A+1 for example] may be skipped if their quantization bits exceed a threshold), transmit the size of the weight change amount of the first layer to the second layer (even if the first layer, A+1, is skipped, it still participates in the training process [it just doesn't get updated]. Additionally, a change of weights for layer A is based on the change of weights of layer A+1, meaning the size of weight change amount [change of weights] of the first layer [A+1] is transmitted to the second layer [A]) and update the weight of the second layer (updated weights determined for each layer during re-training).
Reasoning as to why it would be obvious for the quantization threshold of Yang to be the first threshold disclosed in claim 1 will be explained further below.
Yang fails to teach but Kai teaches;
a first layer of which the size of the weight change amount exceeds the first threshold value
([col. 3, lines 38-51] Specifically, for example, the reference technique detects a layer in which a learning rate indicating a progress of learning is deteriorated and omits learning with respect to the layer so as to shorten the learning time. For example, in each layer in which a difference between an error gradient at the time of current iteration and an error gradient at the time of previous iteration is equal to or more than a threshold, learning is performed as usual at the time of next iteration. In each layer in which the difference is less than the threshold, learning skip is performed at the time of next iteration. In other words, for example, in the layer in which the learning rate is deteriorated, the subsequent machine learning processing for calculating an error gradient or the like is suppressed.)
NOTE: Kai discloses a method for skipping layers where layers having a size of weight change amount (difference of error gradient) greater than a threshold are not skipped, and the remaining layers are skipped.
OBVIOUSNESS:
([col. 4, lines 5-15] As illustrated in FIG. 3, in deep learning of a machine learning model, machine learning (calculation processing) through forward propagation and processing for updating weights or the like through backward propagation are executed.)
NOTE: Kai additionally states a method of backward propagation for updating weights, which would transmit gradient values between layers, similarly.
This process of selecting layers of a neural network to skip based on a threshold value is significantly similar to the skipping process disclosed by Yang, and the substitution of the threshold values (substituting the quantization threshold of Yang with the size of weight change threshold of Kai) would therefore be a simple substitution of one known element for another to yield predictable results.
Further, the objective of both of these processes is to shorten the learning time of the model to improve efficiency.
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to perform the process of claim 5 as taught by Yang, swapping the threshold utilized by Yang with the threshold utilized by Kai, to skip designated layers to improve efficiency of the system during learning.
Regarding claim 10,
Claim 10 is a method directly corresponding to claim 1, and is rejected using the same reasoning.
Regarding claim 14,
Claim 14 is a method directly corresponding to claim 5, and is rejected using the same reasoning.
Claim(s) 2, 11 is/are rejected under 35 U.S.C. 103 as being unpatentable Yang (US 20200302276 A1, 2020-09-24) in view of Kai (US 12505379 B2, 2025-12-23) further in view of Ayush Manish Agrawal et al. (hereinafter Agrawal) (“Investigating Learning in Deep Neural Networks using Layer-Wise Weight Change”, 2020-12-01).
Regarding claim 2,
Yang in view of Kai teaches;
The electronic device of claim 1, wherein the processor is further configured to:
(Using the same reasoning from claim 1)
Yang teaches;
and train the neural network model by updating a weight of at least one layer previous to
([0047] Returning to FIG. 4, in some examples, the operations 408 and 410 may also be performed in a layer by layer fashion in a backward propagation, in which a change of weights is determined for each layer in a CNN from the last year to the first layer (or a subset of the convolution layers in the CNN), and the weights in each layer are updated based on the change of weights.)
NOTE: Teaches updating the weights of every layer in a direction from the output layer to the input layer. This would include updating every layer previous to some other identified layer.
Yang fails to teach but Kai teaches;
identify an initial first layer for which the size of the weight change amount is less than the first threshold value,
([col. 3, lines 38-51] Specifically, for example, the reference technique detects a layer in which a learning rate indicating a progress of learning is deteriorated and omits learning with respect to the layer so as to shorten the learning time. For example, in each layer in which a difference between an error gradient at the time of current iteration and an error gradient at the time of previous iteration is equal to or more than a threshold, learning is performed as usual at the time of next iteration. In each layer in which the difference is less than the threshold, learning skip is performed at the time of next iteration. In other words, for example, in the layer in which the learning rate is deteriorated, the subsequent machine learning processing for calculating an error gradient or the like is suppressed.)
NOTE: Teaches identifying an initial first (identifying each layer in which the difference is less than the threshold includes at least an initial first layer) layer for which the size of the weight change amount it less than the first threshold value.
Yang and Kai fail to teach but Agrawal teaches;
in a direction from an output layer to an input layer of the neural network model, identify an initial first layer for which the size of the weight change amount is less
([pg. 7] In general, we see that relative weight change increases in later layers as compared to earlier ones across the different convolutional architectures, both deep and shallow, and across the different classification tasks.)
NOTE: Teaches in a direction from an output layer to an input layer of the neural network model, identifying layers (which would include a first layer) for which the size of the weight change amount is less (they identify that the weight changes for successive layers in a direction from the output layer to the input layer (later layers to earlier layers) decreases, i.e. the weight change of earlier layers is less than that of later layers).
From this identification, it would then be obvious to use the threshold value taught by Yang to identify the claimed initial first layer in a direction from the output layer to the input layer, further explained below. OBVIOUSNESS TO COMBINE ARGRAWAL WITH YANG AND KAI:
Agrawal is analogous art to Yang, Kai, and the present disclosure as they all pertain to neural networks. Specifically, Agrawal pertains to investigating learning in deep neural networks using layer-wise weight changes.
Agrawal states;
([pg. 7] In general, we see that relative weight change increases in later layers as compared to earlier ones across the different convolutional architectures, both deep and shallow, and across the different classification tasks.)
NOTE: Agrawal discloses that layers closer to the output tend to exhibit greater relative weight change than layers closer to the input.
Kai teaches using a threshold on a weight-change amount to determine whether a layer should continue to be updated. In view of Agrawal’s teaching that later layers tend to have larger weight changes, a person of ordinary skill in the art would have understood that layers closer to the output are more likely to satisfy or exceed Kai’s threshold, while progressively earlier layers are more likely to fall below that threshold.
Accordingly, if one were applying Kai’s threshold criterion to the layer-wise weight changes described by Agrawal in order to identify a boundary between layers that should continue to be updated and layers do not need to be updated, it would have been obvious to examine the layers in the direction from the output layer toward the input layer. In that traversal, the first layer encountered whose weight-change amount is less than the threshold would naturally identify the cutoff point.
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to perform the identification of the initial first identified layer in a direction from the output layer to the input layer, in order to simplify finding the cutoff point for which layers of the neural network need to be updated.
Regarding claim 11,
Claim 11 is a method directly corresponding to claim 2, and is rejected using the same reasoning.
Claim(s) 3, 12 is/are rejected under 35 U.S.C. 103 as being unpatentable Yang (US 20200302276 A1, 2020-09-24) in view of Kai (US 12505379 B2, 2025-12-23) further in view of Sannidhi P Kumar et al. (hereinafter Kumar) (“Meta-Cognition-Based Simple and Effective Approach to Object Detection”, 2020-12-2).
Regarding claim 3,
Yang in view of Kai teaches;
The electronic device of claim 1,
(Using the same reasoning from claim 1)
Yang and Kai fail to teach but Kumar teaches;
wherein the processor is further configured to: based on a learning number of the neural network model exceeding a preset value, update the first threshold value to a second threshold value, wherein the second threshold value is a value smaller than the first threshold value.
[pg. 2]
PNG
media_image2.png
180
339
media_image2.png
Greyscale
NOTE: Teaches based on a learning number of the neural network model exceeding a preset value (one epoch), update the first threshold value to a second threshold value (each epoch the threshold is updated), wherein the second threshold value is a value smaller than the first threshold value (the threshold value decays each iteration, i.e. the second threshold is smaller than the previous).
OBVIOUSNESS TO COMBINE KUMAR WITH YANG AND KAI:
Kumar is analogous art to Yang, Kai, and the present disclosure as they all pertain to machine learning. Specifically, Kumar pertains to improving deep-learning-based object detection models.
Additionally, Kumar states;
PNG
media_image2.png
180
339
media_image2.png
Greyscale
NOTE: Kumar teaches that a threshold may be decreased as training progresses because loss is higher in the initial phase and decreases over time. Thus, one would have been motivated to update the first threshold value of claim 1 to a smaller threshold value after further training so that the threshold tracks the changing training state.
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to update the first threshold value of claim 1 to a second, smaller threshold value after further training, so that the threshold tracks the changing training state.
Regarding claim 12,
Claim 12 is a method directly corresponding to claim 3, and is rejected using the same reasoning.
Claim(s) 4, 13 is/are rejected under 35 U.S.C. 103 as being unpatentable Yang (US 20200302276 A1, 2020-09-24) in view of Kai (US 12505379 B2, 2025-12-23) further in view of Yasushi Hara (hereinafter Hara) (US 20210397948 A1, 2021-12-23).
Regarding claim 4, Yang in view of Kai teaches;
The electronic device of claim 1, wherein the processor is further configured to:
(Using the same reasoning from claim 1)
Yang teaches;
store, in the memory, a size of weight change amount of each layer of the plurality of layers obtained based on the neural network model being trained i times
([0036] FIG. 3B illustrates a diagram of an example process of backward propagation in re-training weights of a neural network in accordance with various examples described herein. In some examples, the retraining process (e.g., 204, 206 in FIG. 2) may be implemented in a backward propagation network 320. In FIG. 3B, in the backward propagation network 320, each of the convolution layers of the CNN model may be updated based on a change of weights. For example, the change of weights for each layer may be determined based on the change of weights in the proceeding layer in the backward propagation)
NOTE: Teaches storing, in the memory, a size of weight change amount of each layer of the plurality of layers obtained (change of weights for each layer) based on the neural network model being trained i times (the change of weights is determined during backpropagation, which is part of the neural network model training process, and being trained ‘i’ times could be a single time if ‘i’ == 1)
Yang and Kai fail to teach but Hara teaches;
obtain, for each layer of the plurality of layers, a difference between a size of weight change amount of the layer obtained in an i+1th training of the neural network model and the stored size of weight change amount of the layer,
([0103] The following description will be made on a case where the GPU 104-1 determines whether to specify the layer n as a skip layer in an iteration m. In an iteration m−1, the GPU 104-1 records an error gradient Δw.sub.n,m-1 of the layer n. In the iteration m, the GPU 104-1 calculates an error gradient Δw.sub.n,m of the layer n and calculates an error gradient difference ΔA.sub.n,m=Δw.sub.n,m-1−Δw.sub.n,m by subtracting the error gradient in the iteration m from the error gradient in the iteration m−1.)
NOTE: Teaches obtaining, for each layer of the plurality of layers (calculating a gradient difference for each layer), a difference between a size of weight change amount (under BRI, the error gradient is a weight-update-related quantity used to determine the change to a layer’s weights, and is therefore considered to be a weight change amount) of the layer obtained in an i+1th training (iteration m of the above disclosed training process) of the neural network model and the stored size of weight change amount of the layer (iteration m-1).
and train the neural network model by updating the weight of at least one layer for which the obtained difference is greater than or equal to a third threshold value.
([0105] The GPU 104-1 determines whether the error gradient difference ΔA.sub.n,m is less than the threshold. If the error gradient difference ΔA.sub.n,m is equal to or more than the threshold, the GPU 104-1 performs the BACKWARD phase, the COMMUNICATE phase, and the UPDATE phase on the layer n in an iteration m+1, without specifying the layer n as a skip layer.)
NOTE: Layers having the obtained difference (the aforementioned gradient difference) greater than or equal to a third threshold (error gradient difference greater than a threshold) are not specified as a skip layer during the backward phase, meaning they do participate in the backward phase.
([0076] In the BACKWARD phase, the GPU 104-1 calculates the error gradients of the weights of the individual edges in the backward order from the output layer to the input layer of the multilayer neural network 310… These error gradients are used to update the weights of the edges in such a manner that the error is reduced.)
NOTE: Teaches training the neural network model by updating the weight (in the backward phase, the error gradients are used to update the weights) of at least one layer for which the obtained difference is greater than or equal to a third threshold value (the aforementioned non-skip layers participate in the backward phase, where the non-skip layers are the layers of which the obtained difference is greater than aforementioned the third threshold value).
OBVIOUSNESS TO COMBINE HARA WITH YANG AND KAI:
Hara is analogous art to Yang and Kai as they all pertain to machine learning. Specifically, Hara pertains to a learning method utilizing a calculated gradient of the error for each layer in a machine learning model.
Additionally, Hara states;
([0039] The information processing apparatus 10 according to the first embodiment calculates the difference 17 between the error gradient 17a of the layer 13b in the iteration 16a and the error gradient 17b of the layer 13b in the iteration 16b. If the difference 17 is less than the threshold 18, the calculation of the error gradient of the layer 13b and the updating of the parameter 14b are skipped in the subsequent iteration 16c. In this way, unnecessary parameter update processing is skipped for a layer whose parameter optimization has converged earlier than the other layers and whose parameter will not improve. Thus, since less unnecessary processing is performed in the machine learning, the calculation amount is reduced. In addition, the execution time of the machine learning for generating the model 13 is consequently shortened.)
NOTE: This excerpt discloses that the process of updating only layers of which the obtained difference is greater than the threshold allows for less unnecessary processing and improved execution time of the model.
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to obtain a difference for each layer and to update only the weights of the layers having a difference exceeding a threshold as taught by Hara, in order to improve the efficiency of the system of claim 1.
Regarding claim 13,
Claim 13 is a method directly corresponding to claim 4, and is rejected using the same reasoning.
Claim(s) 6, 15 is/are rejected under 35 U.S.C. 103 as being unpatentable Yang (US 20200302276 A1, 2020-09-24) in view of Kai (US 12505379 B2, 2025-12-23) further in view of Krishnamoorthy, Madhusudhanan (hereinafter Krishnamoorthy) (US 20210042623 A1, 2021-02-11).
Regarding claim 6, Yang in view of Kai teach;
The electronic device of claim 1, wherein the processor is further configured to:
(Using the same reasoning as the claim 1 rejection)
Yang and Kai fail to teach but Krishnamoorthy teaches;
insert a third layer into a region of at least one of the plurality of layers,
[fig. 3]
PNG
media_image3.png
851
1255
media_image3.png
Greyscale
NOTE: Teaches at least 2 other layers, where an inserted layer would then be considered a third layer.
([0130] In some embodiments, for mapping the mutations in hyper parameters, the system may determine that the plurality of first convolutional neural network layers of the first image processing model is associated with a first number of convolutional neural network layers. The system may further determine that the plurality of second convolutional neural network layers of the second image processing model is associated with a second number of convolutional neural network layers. In response, the system may determine at least one mutation based on determining that the second number of convolutional neural network layers is different from the first number of convolutional neural network layers, e.g., based on identifying that one or more layers have been inserted or removed.)
NOTE: Teaches a neural network having one or more (such as a third layer) inserted layers in the plurality of layers.
obtain a second loss function based on output data, obtained by inputting the learning data to a neural network model into which the third layer is inserted, and the label corresponding to the learning data,
([0128] Next, the system may determine a second plurality of hyper parameters associated with the second image processing model, at block 708. The second plurality of hyper parameters of the second image processing model may be similar to those described above, and may comprise... (iv) second loss function, ...)
NOTE: Teaches an obtained second loss function for the neural network.
([0124] Moreover, each image processing model typically comprises one or more loss functions. In machine-learning image processing and classification, minimized objective functions or loss functions may represent how well the program predicts the expected outcome in comparison with the ground truth, i.e., the cost/value of inaccuracy of predictions (problems of identifying which category a particular image belongs to))
NOTE: Teaches the second loss function being based on output data, obtained by inputting the learning data (images) to the neural network model into which the third layer is inserted (the loss function represents how well the program predicts the expected outcome in comparison with the ground truth), and the label corresponding to the learning data (ground truth).
and train the neural network model by updating a weight of the third layer based on the second loss function.
([0125] The learning rate is a hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated. It determines to what extent newly acquired information overrides old information, i.e., indicates learning rate decay or momentum. In some embodiments, the learning rate component is a configurable hyperparameter used in the training of neural networks that has a small positive value, typically in the range between 0.0 and 1.0. In other words, the amount that the weights of the model are updated during training is referred to as the step size or the learning rate component. Each image processing model may further comprise one or more optimization functions. Optimization functions are structured to minimize (or maximize) an objective function, i.e., an Error function of the model.)
NOTE: Teaches training the neural network model by updating a weight of the third layer (weights of the model are updating during training, which would include the weights of the inserted third layer) based on the second loss function (the amount that the weights are updated is influence by the estimated error of the model, which is determined using the loss function)
OBVIOUSNESS TO COMBINE KRISHNAMOORTHY WITH YANG AND KAI:
Krishnamoorthy is analogous art to Yang and Kai and the present disclosure as they all pertain to neural networks. Specifically, Krishnamoorthy pertains to a management process for image processing models.
Additionally, inserting a third layer into a neural network, obtaining a new loss function, and training the neural network containing the third loss function performs the same function separately as it does in combination with the system of claim 1. One of ordinary skill in the art would have recognized that the results of this combination is predictable.
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to insert a third layer into the claimed neural network, obtain a second loss function, and train the resulting neural network, as performing these processes on the neural network of claim 1 is a combination of prior art elements according to known methods to yield predictable results.
Regarding claim 15,
Claim 15 is a method directly corresponding to claim 6, and is rejected using the same reasoning.
Claim(s) 7 is/are rejected under 35 U.S.C. 103 as being unpatentable Yang (US 20200302276 A1, 2020-09-24) in view of Kai (US 12505379 B2, 2025-12-23) further in view of Zhang Weihua (hereinafter Weihua) (KR 20200100558 A, 2020-08-26).
Regarding claim 7, Yang in view of Kai teaches;
The electronic device of claim 1, wherein the processor is further configured to:
(Using the same reasoning as in claim 1)
Yang and Kai fail to teach but Weihua teaches;
reduce a size of feature data extracted through the learning data by a predetermined size,
([pg. 19] In order to control the ratio of global features, the results of the pooling are each processed by a corresponding convolutional layer to reduce the size of feature maps 1 to 4 to obtain feature maps 1'to 4'of reduced dimensions.)
NOTE: Teaches reducing a size of feature data through the learning data by a predetermined size (feature maps are the output of CNNs on their inputs, and the excerpt discloses reducing the size of feature maps to obtain feature maps of reduced dimensions).
insert a deconvolution layer into the neural network model, and train the neural network model in which the deconvolution layer is inserted.
([pg. 22] As can be seen from the above description, compared to the convolutional layer, the residual network of FIG. 17B, that is, the residual connection block, has a better data processing effect, and the amount of its parameters can be further reduced. In addition, since the convolution operation can reduce the size of the original input image, the network structure in this example adds a deconvolution layer to achieve the effect of upsampling, thereby reducing the size of the input image after convolution. Restored feature map, which improves the image processing effect and better meets the actual needs.)
NOTE: Teaches inserting a deconvolution layer into the neural network model (adds a deconvolution layer).
([pg. 18] Here, the object segmentation model is obtained by training a neural network.)
NOTE: Teaches training the neural network model in which the deconvolution layer is inserted.
OBVIOUSNESS TO COMBINE WEIHUA WITH YANG AND KAI:
Weihua is analogous art to Yang, Kai, and the present disclosure as they all pertain to neural networks. Specifically, Weihua pertains to an image processing method and device, using convolutional neural networks.
Additionally, Weihua states;
([pg. 22] As can be seen from the above description, compared to the convolutional layer, the residual network of FIG. 17B, that is, the residual connection block, has a better data processing effect, and the amount of its parameters can be further reduced. In addition, since the convolution operation can reduce the size of the original input image, the network structure in this example adds a deconvolution layer to achieve the effect of upsampling, thereby reducing the size of the input image after convolution. Restored feature map, which improves the image processing effect and better meets the actual needs.)
NOTE: This excerpt details that the benefit of the deconvolutional layer is to up sample or restore the feature map, thereby improving the processing of the output, and better satisfying practical needs of the model.
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to reduce the size of feature data and insert a deconvolution layer (as taught by Weihua) into the system of claim 1 to reconstruct reduced feature data and improve the processing of the model output.
Claim(s) 8 is/are rejected under 35 U.S.C. 103 as being unpatentable Yang (US 20200302276 A1, 2020-09-24) in view of Kai (US 12505379 B2, 2025-12-23) further in view of Akio Hayakawa et al. (hereinafter Hayakawa) (“OUT-OF-CORE TRAINING FOR EXTREMELY LARGE-SCALE NEURAL NETWORKS WITH ADAPTIVE WINDOW-BASED SCHEDULIN”, 2020-10-27).
Regarding claim 8,
Yang in view of Kai teaches;
The electronic device of claim 1, wherein the processor is further configured to:
(Using the same reasoning from claim 1)
Yang and Kai fail to teach but Hayakawa teaches;
set a window to include at least a fourth layer consecutively connected among the plurality of layers and data related to the fourth layer,
perform an operation by loading each layer and data included in the window, and following completion of the operation:
([pg. 1-2] One possible way to address the limitations on GPU memory size is “out-of-core execution”. This method utilizes CPU memory as a temporary cache for the GPU computation. Since neural networks, especially feed-forward networks, can be executed layer by layer sequentially, we can trans fer data from GPU to CPU memorywhenthevariables are not necessary at the current computation. In fact, the CPU memory size is much larger than GPU memory, e.g., larger than 1TB. Thus, using CPUmemoryasacache for GPU memory, we can virtually extend the size of GPU memory, as if it has memory larger than 1TB. 1 Asanaive strategy to realize out-of-core execution, we can transfer memory between GPU and CPU before and after every layer execution. While this approach can execute the maximum size of model on limited memory budget, this approach puts GPU computation on hold at every layer until the end of corresponding memory transfers. On the other hand, if we place too many variables on GPU to accelerate computation, we can execute only models with limited size. Therefore, it is necessary to f ind a better memory transfer algorithm that enables execution of larger models without sacrificing computational time.)
[pg. 4]
PNG
media_image4.png
709
1132
media_image4.png
Greyscale
NOTE: Teaches setting a window to include at least a fourth layer (f_i) consecutively connected among the plurality of layers, and performing an operation by loading each layer and data included in the window (insert variables corresponding to the layers within the window into the swap-in queue).
slide the window by a preset unit relative to the plurality of layers, such that the fourth layer is newly excluded from the window and a fifth layer is newly included in the window,
unload the fourth layer and data related solely to the fourth layer, and load the fifth layer and data related to the fifth layer.
[pg. 4]
PNG
media_image5.png
724
975
media_image5.png
Greyscale
NOTE: Teaches following completion of the aforementioned operation, sliding the window by a preset unit relative to the plurality of layers (slide window to f_i+1), such that the fourth layer is newly excluded from the window (f_i excluded) and a fifth layer is newly included in the window (f_i+4 is included), and unload the fourth layer and data related solely to the fourth layer (move variables of f_i from swap-in to swap-out queue), and load the fifth layer and data related to the fifth layer (as shown in step (a) above, the newly included layer f_i+4 and the corresponding data will be inserted into the swap-in queue).
OBVIOUSNESS TO COMBINE HAYAKAWA WITH YANG AND KAI:
Hayakawa is analogous art to Yang, Kai, and the present disclosure as they all pertain to neural networks. Specifically, Hayakawa pertains to an out of core algorithm to enable faster training of large-scale neural networks with sizes larger than allotted GPU memory.
Additionally, Hayakawa states;
([pg. 1] One possible way to address the limitations on GPU memory size is “out-of-core execution”. This method utilizes CPU memory as a temporary cache for the GPU computation. Since neural networks, especially feed-forward networks, can be executed layer by layer sequentially, we can trans fer data from GPU to CPU memory when the variables are not necessary at the current computation. In fact, the CPU memory size is much larger than GPU memory, e.g., larger than 1TB. Thus, using CPU memory as a cache for GPU memory, we can virtually extend the size of GPU memory, as if it has memory larger than 1TB.)
NOTE: Teaches that the out-of-core methods proposed by Hayakawa allow for virtually extending the size of GPU memory of the system by a large amount.
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to use the methods of Hayakawa within the system of the present disclosure to allow processing of large neural networks which require a large amount of memory.
Claim(s) 9 is/are rejected under 35 U.S.C. 103 as being unpatentable Yang (US 20200302276 A1, 2020-09-24) in view of Kai (US 12505379 B2, 2025-12-23) further in view of Sameer D. Hyderabad Manikfan (hereinafter Manikfan) (US 20190319477 A1, 2019-10-17).
Regarding claim 9,
Yang in view of Kai teaches;
The electronic device of claim 1,
(Using the same reasoning from claim 1)
Yang and Kai fail to teach but Manikfan teaches;
wherein the processor is further configured to: fix a layer, other than an output layer, among the plurality of layers,
([0052] The echo state networks 445 and 420 are neural networks with memory. The weights of internal and input layers are fixed and initialized at random. Only the weights of the output neurons are updated and hence these networks are best for realization in an FPGA taking advantage of fast computation of weights only at the output layer and not at the internal and input layers.)
NOTE: Teaches fixing every layer of the neural network except the output layer.
update a weight of the output layer based on the learning data, and train the neural network model including the trained output layer.
([0086] In operation 845, the ESC system 400 determines if the training is completed when the control error is within a threshold in operation 840. If the training has not completed, the training data set is read again in operation 815.)
NOTE: This excerpt indicates that training is performed using learning data (training data set used in training operation).
([0066] The weights 623, 643 of output layers 626, 646 are trained using the echo state network learning algorithm. Having to train only the weights 622, 642 of output layers 626, 646 results in faster convergence as compared to conventional recurrent neural networks.)
NOTE: Teaches updating a weight of the output layer (train only weights of output layers) based on the learning data (the aforementioned learning data used in the training process), and train the neural network model including the trained output layer (training the output layers of the neural network is additionally considered training the neural network).
OBIOUSNESS TO COMBINE MANIKFAN WITH YANG AND KAI:
Manikfan is analogous art to Yang and Kai as it pertains to methods utilizing machine learning processes. Specifically, Manikfan pertains to an energy storage controller utilizing a neural network to model the system.
Additionally, Manikfan states;
([0066] The weights 623, 643 of output layers 626, 646 are trained using the echo state network learning algorithm. Having to train only the weights 622, 642 of output layers 626, 646 results in faster convergence as compared to conventional recurrent neural networks.)
NOTE: This excerpt indicates that by fixing layers other than the output layers while leaving the output layers unfixed allows for faster convergence of the neural network, which would allow for less training time, lower compute costs, and lower energy use.
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to fix layers other than the output layer as taught by Manikfan, in order to improve the overall efficiency of the neural network.
CONCLUSION
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Matthew Alan Cady whose telephone number is (571) 272-7229. The examiner can normally be reached Monday - Friday, 7:30 am - 5:00 pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Cesar Paula can be reached on (571)272-4128. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC)
at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/MATTHEW ALAN CADY/ Examiner, Art Unit 2145
/CESAR B PAULA/ Supervisory Patent Examiner, Art Unit 2145