DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 2/6/26 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Claim Objections
In view of the amendment to claim 13, the objection has been removed.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
XU and WU and CHOUKROUN
Claims 1, 5-17, and 22-25 are rejected under 35 U.S.C. 103 as being unpatentable over US 20220067527 A1, referenced herein as XU, and “Integer Quantization for Deep Learning Inference: Principals and Empirical Evaluation” by Hao Wu et al., as included with the Non-Final Office Action mailed September 25, 2024, referenced herein as WU, and CN 115280375 A, referenced herein as CHOUKROUN.
(Please see the attached copy of CHOUKROUN that numbers paragraphs in the same format as that used in this Action).
Claim 1
XU teaches “A method, comprising: generating an intermediate trained neural network by at least changing precision of one or more first weights and one or more first activation values of a portion of a neural network” (XU [0030]: “During the same training iteration (e.g., the forward and backward pass of training iteration) quantization of weights, gradients, and activations may be performed”; Examiner’s Note (EN): The examiner notes that, as written and in light of the instant specification, the BRI of a change in precision encompasses quantization, which changes the encoding width, or precision, of a value in a computational setting. Paragraph [0063] of the instant specification states “The first trained model 110 may also be referred to herein as an intermediate trained model, a QAT trained model, QAT quantized model, or a QAT model”, but does not appear to explicitly define a first trained model or a model. As such, as written and in light of the instant specification, a person of reasonable skill in the art will appreciate that training a machine learning model generates at least one trained model. XU explicitly discloses an iterative training process where each iteration of training generates a set of parameters which define a model and “these updated weights may then be used in a subsequent training iteration” (XU [0042]). Thus, by disclosing a method for quantization during an iteration of training, XU teaches changing precision of weight and activation values of a portion of a neural network generating an intermediate trained neural network; see also Fig. 6 & [0054-0055], where neural network is trained by determining activations for layer based on sparsified-quantized weights and quantizing activation values); “generating a final trained neural network by at least: recalibrating one or more range/scale factors for the one or more first activation values to produce one or more second activation values; changing precision of the one or more weights of the intermediate trained neural network to produce one or more second weights; and generating the final trained neural network using the one or more second weights and the one or more second activation values” (see Fig. 6 & [0054-0055], where in 635 if it's not last layer in the set of neural network layers, steps to sparsify weights in layer of NN (615), quantize sparsified weights (620), determine activations for layer based on sparsified-quantized weights (625), and quantize activation values (630) are repeated, and if it’s the last layer (635) and the last training iteration (640), compressed version of the NN model to incorporate sparsified-quantized weights is generated).
XU does not appear to explicitly teach “wherein the portion of the neural network comprises a subset of layers of the neural network that excludes at least one layer of the neural network”.
However, in the same field, analogous art WU provides this additional functionality by teaching “wherein the portion of the neural network comprises a subset of layers of the neural network that excludes at least one layer of the neural network” (page 2, section 2, paragraph 2, WU: “In many cases, the first and last layer were not quantized, or quantized to a higher bit-width, as they are more sensitive to quantization [59], [45], [18]”; (EN): Examiner notes that excluding at least one layer encompasses excluding the first layer and excluding the last layer).
XU and WU are analogous art because they are from the same field of endeavor as the claimed invention, namely optimization of machine learning. XU teaches a method for quantizing a model to generate a trained machine learning model. WU provides the additional functionality by disclosing a method for quantization which excludes the final layer or quantizes the final layer to a different bit width. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have improved upon the machine learning system of XU with WU’s method of layer-aware quantization in order to “reduce the size of Deep Neural Networks and improve inference latency and throughput by taking advantage of high throughput integer instructions” (page 1, ABSTRACT, WU), as suggested by WU.
XU and WU do not explicitly show the precision of the one or more first activation values being changed based at least in part on corresponding activation values computed over multiple training iterations. However, in the same field, analogous art CHOUKROUN provides this additional functionality by teaching “the precision of the one or more first activation values being changed based at least in part on corresponding activation values computed over multiple training iterations” (para 84, 121-122, 194 show the plurality of training iterations over which the activation values are calculated increases the precision of the activation values).
XU and CHOUKROUN are analogous art because they are from the same field of endeavor as the claimed invention, namely quantization of a neural network in which precision of weights of activation values are changed. XU teaches a method for quantizing a model to generate a trained machine learning model. CHOUKROUN provides the additional functionality by disclosing a method for quantization in which the precision of the one or more first activation values being changed based at least in part on corresponding activation values computed over multiple training iterations. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have improved upon the machine learning system of XU and WU with this feature of CHOUKROUN, because it would provide an efficient way by which to increase the precision of the activation values while training a neural network, as suggested by CHOUKROUN.
Claim 5
The combination of XU and WU teach “The method of claim 1”, as discussed above.
XU further teaches “wherein the intermediate trained neural network is a mixed-precision deep neural network (DNN)” ([0054], XU: “FIG. 6 is a simplified flow diagram 600 illustrating an example technique for generating a compressed version of a neural network model (e.g., a convolutional neural network or another deep neural network)”; (EN): Paragraph [0095] of the instant specification states “In the first step, QAT training (as mentioned as a part of process 502) is performed by adding quantization operations to all weights and activations of the network except for the last layer. This results in a mixed-precision DNN model”, but does not appear to explicitly define a mixed-precision DNN model. As such, as written and in light of the instant specification, the BRI of a mixed-precision DNN model includes a DNN model which contains at least two values with different precision levels, otherwise known as bit width encodings in the computational space. By teaching a method for training a model which retains “the full-precision weights of a neural network (e.g., in full-precision weight data 245) along with the new compressed sparse-quantized weight data 250 that is determined through the compression performed (during training)” ([0031], XU), XU teaches a mixed-precision model).
Claim 6
The combination of XU and WU teaches “The method of claim 1”, as discussed above.
XU further teaches “wherein the final trained neural network comprises 8-bit integer numbers representing the one or more first weights and the one or more first activation values” ([0024], XU: “utilized to reduce the number of bits required to represent each weight (or activation value)… may be used to quantize weights to 8-bit or higher precision from a pre-trained full precision network with and without fine-tuning”; (EN): With reference to joint optimization methods, XU discloses that “such an approach may realize 17x compression for ResNet50 where both weights and activations are quantized” ([0025], XU)).
Claim 7
The combination of XU and WU teaches “The method of claim 1”, as discussed above.
WU further teaches “wherein the at least one layer of the neural network includes weights and activation values having finer granularity in representation than 8-bit quantization” (page 2, section 2, paragraph 2, WU: “In many cases, the first and last layer were not quantized, or quantized to a higher bit-width, as they are more sensitive to quantization [59], [45], [18]”; (EN): Paragraph [0070] of the instant specification states “QAT 108 may be applied by quantizing all weights and activations of the neural network except for layers that require finer granularity in representation than the 8-bit quantization can provide (e.g., regression layers)”, but does not appear to explicitly define a finer granularity. As such, as written and in light of the instant specification, the BRI of having a finer granularity of in representation encompasses greater precision and a wider bit encoding. A model being unquantized, or quantized to a higher bit-width, is reasonably understood to be encompassed by a model including weights and activations which have a finer granularity in representation than 8-bit).
Claim 8
The combination of XU and WU teaches “The method of claim 1”, as discussed above.
XU further teaches “further comprising using the final trained neural network to perform object detection or image classification” ([0019], XU: “devices may utilize neural network models in connection with detecting persons, animals, or objects within their respective environments and/or conditions, characteristics, and events within these environments based on sensor data”).
Claim 9
The combination of XU and WU teaches “The method of claim 1”, as discussed above.
XU further teaches “wherein changing precision of the one or more first weights to generate the intermediate trained neural network is performed by applying quantization aware training (QAT)” ([0024], XU: “Training with quantization for low-precision networks may be used to train CNNs that have low-precision weights and activations using low-precision parameter gradients”; (EN): With reference to FIG. 2, paragraphs [0072]-[0086] of the instant specification discuss embodiments and non-limiting examples of QAT, including paragraph [0077] of the instant specification which outlines “example pseudocode to perform as such”, but does not appear to explicitly define QAT. As such, as written and in light of the instant specification, the BRI of QAT encompasses training with quantization).
Although the combination of this embodiment of XU and WU substantially discloses the claimed invention, the combination of this embodiment of XU and WU does not appear to explicitly teach “and wherein changing precision of the one or more weights and the one or more activation values to generate the second trained model is performed by applying post-training quantization (PTQ)”.
However, in the same field, the analogous art found in XU’s following second embodiment provides this additional functionality by teaching “and changing precision of the one or more first weights to produce the one or more second weights is performed by applying post-training quantization (PTQ)” ([0024], XU: “Network quantization is another popular compression technique, which is utilized to reduce the number of bits required to represent each weight (or activation value) in a network… For instance, post-training quantization may be used to quantize weights to 8-bit or higher precision from a pre-trained full-precision network with and without fine-tuning”; (EN): Paragraph [0071] states “In at least one embodiment, GPU 116 then applies PTQ 112 on the first trained model 110. In this second step, as the quantization of activations prescribed by QAT 108 are ignored, PTQ 112 is then performed on the first trained model 110. This way, the first trained model’s 110 weights are quantized again, and range/scale factors for activations are calibrated again using the PTQ 112 process. While the activations were quantized using the running-statistics of their distribution during QAT training, the activations may be re-quantized by calculating their statistics again against the calibration dataset which is expected during deployment. By applying PTQ 112 on a QAT quantized model 110, all layers of the neural network can be quantized”, but does not appear to explicitly define PTQ or the process through which PTQ is performed).
The embodiments of XU are analogous art because they are from the same field of endeavor as the claimed invention, namely optimization of machine learning. XU teaches multiple methods for quantizing a model to generate a trained machine learning model, but the first embodiment does not appear to explicitly disclose utilizing PTQ to generate the second trained model. The disclosed It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have improved upon a machine learning system of XU with XU’s associated methods of optimization because “compression techniques may be combined to enhance the degree of compression applied to the neural network model” ([0024], XU), as suggested by XU.
Claim 10
Regarding claim 10, the claim recites similar limitation as corresponding claim 1 and is rejected for similar reasons as claim 1 using similar teachings and rationale.
XU also teaches “One or more processors, comprising circuitry to:” ([0034], XU: “In one example, machine learning system 260 may include one or more processor devices (e.g., 265) adapted for performing computations and functions to implement machine learning models and inference. For instance, machine learning processors (e.g., 268) may include graphics processing units (GPUs)…”; (EN): With reference to FIG 16., paragraph [0307] of the instant specification states “In at least one embodiment, integrated circuit 1600 includes one or more application processor(s) 1605 (e.g., CPUs), at least one graphics processor 1610”, but does not appear to explicitly define a circuit or an integrated circuit. As such, as written and in light of the instant specification, the BRI of one or more circuits includes a graphics processor, which encompasses a graphics processing unit, or GPU).
Claim 11
The combination of XU and WU teaches “The processor of claim 10”, as discussed above.
XU further teaches “wherein the circuitry is to change precision of the one or more first weights and the one or more first activation values using quantization aware training (QAT)”, ([0024], XU: “Training with quantization for low-precision networks may be used to train CNNs that have low-precision weights and activations using low-precision parameter gradients”; (EN): With reference to FIG. 2, paragraphs [0072]-[0086] of the instant specification discuss embodiments and non-limiting examples of QAT, including paragraph [0077] of the instant specification which outlines “example pseudocode to perform as such”, but does not appear to explicitly define QAT. As such, as written and in light of the instant specification, the BRI of QAT encompasses training with quantization).
WU further teaches “the QAT comprises quantizing the one or more first weights and the one or more first activation values for all layers of the neural network except for a last layer of the neural network” (page 2, section 2, paragraph 2, WU: “Much of the earlier research in this area focused on very low bit quantization [7], [13], [59]… and activations [45], [18]… In many cases, the first and last layer were not quantized, or quantized to a higher bit-width, as they are more sensitive to quantization [59], [45], [18]”).
Claim 12
The combination of XU and WU teaches “The processor of claim 11”, as described above.
XU further teaches “wherein: the circuitry is further to change precision of the one or more first weights and recalibrate the one or more range/scale factors for the one or more first activation values of the intermediate trained neural network using post-training quantization (PTQ)” ([0024], XU: “For instance, post-training quantization may be used to quantize weights to 8-bit or higher precision from a pre-trained full-precision network with and without fine-tuning”; (EN): Paragraph [0071] states “In at least one embodiment, GPU 116 then applies PTQ 112 on the first trained model 110. In this second step, as the quantization of activations prescribed by QAT 108 are ignored, PTQ 112 is then performed on the first trained model 110. This way, the first trained model’s 110 weights are quantized again, and range/scale factors for activations are calibrated again using the PTQ 112 process. While the activations were quantized using the running-statistics of their distribution during QAT training, the activations may be re-quantized by calculating their statistics again against the calibration dataset which is expected during deployment. By applying PTQ 112 on a QAT quantized model 110, all layers of the neural network can be quantized”, but does not appear to explicitly define PTQ or the process through which PTQ is performed. As such, as written and in light of the instant specification, by teaching quantizing a pre-trained).
XU further teaches “the PTQ comprises: re-quantizing the one or more first weights for all layers of the neural network in addition to the last layer of the neural network” ([0024], XU: “Network quantization is another popular compression technique, which is utilized to reduce the number of bits required to represent each weight (or activation value) in a network… For instance, post-training quantization may be used to quantize weights to 8-bit or higher precision from a pre-trained full-precision network with and without fine-tuning”; (EN): Paragraph [0071] states “In at least one embodiment, GPU 116 then applies PTQ 112 on the first trained model 110. In this second step, as the quantization of activations prescribed by QAT 108 are ignored, PTQ 112 is then performed on the first trained model 110. This way, the first trained model’s 110 weights are quantized again, and range/scale factors for activations are calibrated again using the PTQ 112 process. While the activations were quantized using the running-statistics of their distribution during QAT training, the activations may be re-quantized by calculating their statistics again against the calibration dataset which is expected during deployment. By applying PTQ 112 on a QAT quantized model 110, all layers of the neural network can be quantized”, but does not appear to explicitly define PTQ or the process through which PTQ is performed; see also XU Fig. 6 & [0054-0055]);
“recalibrating the one or more range/scale factors for the one or more first activation values for all layers of the neural network in addition to the last layer of the neural network (see Fig. 6 & [0054-0055], where in 635 if it's not last layer in the set of neural network layers, steps to sparsify weights in layer of NN (615), quantize sparsified weights (620), determine activations for layer based on sparsified-quantized weights (625), and quantize activation values (630) are repeated, and if it’s the last layer (635) and the last training iteration (640), compressed version of the NN model to incorporate sparsified-quantized weights is generated).
Claim 13
The combination of XU and WU teaches “The one or more processors of claim 12”, as discussed.
XU further teaches “wherein the PTQ re-quantizes on the one or more first weights and recalibrates the one or more range/scale factors for the one or more first activation values of the intermediate trained neural network by: ignoring the one or more first activation values of the intermediate trained neural network; and re-quantizing the one or more first weights and recalibrating the one or more second activation values” ([0040], XU: “Accordingly, a network compression engine 205 may generate a variety of different compressed versions (e.g., 305') of the same source neural network (e.g., 305), based on the particular combination of parameters input to the network compression engine 205, among other example features”; (EN): Examiner notes that, as written and in light of the instant specification, the BRI of this limitation is directed to quantizing the initial or previous weight and activation values).
XU further teaches “based, at least in part, on statistics against a calibration dataset” ([0044], XU: “the full-precision weights of a subject network are sparsified based on a layer-wise threshold that is computed from the statistics of the full-precision weights in each layer. The non-zero elements of the sparsified weights may then be quantized, for instance with a min-max uniform quantization function”; (EN): The statistics of the full-precision weights in each layer is encompassed by the BRI of statistics against a calibration dataset, as written and in light of the instant specification).
Claim 14
The combination of XU and WU teaches “The one or more processors of claim 13”, as discussed above.
WU further teaches “wherein recalibrating the one or more range/scale factors involves recalibrating at least one of: one or more ranges of the one or more first activation values, or one or more scale factors for the one or more first activation values” (page 2, section 2, paragraph 1, WU: “They also set the quantization range based on a percentile of activations sampled from the training set. Instead of using fixed ranges, Choi et al. [6] proposed PACT which learns the activation ranges during training”; (EN): With reference to FIG. 2, paragraphs [0081]-[0088] of the instant specification discuss examples of calibration, but the instant specification does not appear to explicitly define calibration or recalibration. Learning the activation ranges during training is encompassed by the BRI of recalibrating ranges of activation values, as written and in light of the instant specification).
Claim 15
The combination of XU and WU teaches “The one or more processors of claim 10”, as discussed above.
XU further teaches “wherein the one or more first weights and the one or more first activation values of the intermediate trained neural network are represented using fewer bits than: single-precision floating point representation, double-precision floating point representation, or half-precision floating point representation” ([0024], XU: “utilized to reduce the number of bits required to represent each weight (or activation value)… to 8-bit or higher precision from a full-precision network”; (EN): As written and in light of the instant specification, the BRI of half precision floating point representations includes representations encoded with a bit width of 16. Similarly, as written and in light of the instant specification, the BRI of double-precision includes an encoding with a bit width of 64 and the BRI of single-precision includes an encoding with a bit width of 32).
Claim 16
XU teaches “A system, comprising: one or more processors to:” ([0034], XU: “In one example, machine learning system 260 may include one or more processor devices (e.g., 265) adapted for performing computations and functions to implement machine learning models and inference. For instance, machine learning processors (e.g., 268) may include graphics processing units (GPUs)…”; (EN): As written and in light of the instant specification, the BRI of a processor includes a GPU).
XU further teaches “initiate a training of a machine-learning model with one or more parameters for training by causing the one or more processors to:” ([0039], XU: “train the neural network 305 (e.g., using training data 240, such a proprietary or open source training data set)”; (EN): Paragraph [0061] of the instant specification states “parameters (e.g., weights and activations)”, but does not appear to explicitly define parameters. XU discloses weights and activations for each model (e.g., “weight (or activation value) in a network”, [0024], XU)).
XU further teaches “by quantize a subset of the one or more parameters of one or more layers of a neural network to produce a first set of quantized parameters” ([0024], XU: “Network quantization is another popular compression technique, which is utilized to reduce the number of bits required to represent each weight (or activation)”; (EN): As discussed above, as written and in light of the instant specification, the BRI of parameters of the layer(s) of the neural network includes the weight and activation values).
XU further teaches “generate a first trained machine-learning model using the first set of quantized parameters; re-quantize the first set of quantized parameters […] to produce a second set of quantized parameters; and generate a second trained machine learning model using the second set of quantized parameters” ([0020], XU: “The neural network models may be developed and trained on corresponding computing systems”; (EN): Paragraph [0063] of the instant specification states “The first trained model 110 may also be referred to herein as an intermediate trained model, a QAT trained model, QAT quantized model, or a QAT model”, but does not appear to explicitly define a first trained model or a model; see also Fig. 6 & [0054-0055] as outlined in claim 1 above; [0040], XU: “Accordingly, a network compression engine 205 may generate a variety of different compressed versions (e.g., 305') of the same source neural network (e.g., 305), based on the particular combination of parameters input to the network compression engine 205, among other example features”).
XU does not appear to explicitly teach “the one or more layers excluding at least one layer of the neural network”.
However, in the same field, analogous art WU provides this additional functionality by teaching “the one or more layers excluding at least one layer of the neural network” (page 2, section 2, paragraph 2, WU: “In many cases, the first and last layer were not quantized, or quantized to a higher bit-width, as they are more sensitive to quantization [59], [45], [18]”; (EN): Examiner notes that excluding at least one layer encompasses excluding the first layer and excluding the last layer).
XU does not appear to explicitly teach “and at least one additional parameter outside of the subset”.
Analogous art WU provides this additional functionality by teaching “and at least one additional parameter outside of the subset” (page 2, section 2, paragraph 2, WU: “In many cases, the first and last layer were not quantized, or quantized to a higher bit-width, as they are more sensitive to quantization [59], [45], [18]”; (EN): As discussed above, the BRI of parameters, as written and in light of the instant specification, encompasses the weight and activation values of a model, which includes the set of weight and activation values associated with each layer of the model).
XU and WU are analogous art because they are from the same field of endeavor as the claimed invention, namely optimization of machine learning. XU teaches a method for quantizing a model to generate a trained machine learning model, but does not appear to distinctly disclose changing the precision of the layer which was excluded from the subset of layers whose precision of activation and weight values was changed to generate the first trained model. WU provides the additional functionality by disclosing a method for quantization which excludes the final layer or quantizes the final layer to a different bit width. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have improved upon the machine learning system of XU with WU’s method of layer-aware quantization in order to “reduce the size of Deep Neural Networks and improve inference latency and throughput by taking advantage of high throughput integer instructions” (page 1, ABSTRACT, WU), as suggested by WU.
XU and WU do not explicitly show the precision of the one or more first activation values being changed based at least in part on corresponding activation values computed over multiple training iterations. However, in the same field, analogous art CHOUKROUN provides this additional functionality by teaching “the precision of the one or more first activation values being changed based at least in part on corresponding activation values computed over multiple training iterations” (para 84, 121-122, 194 show the plurality of training iterations over which the activation values are calculated increases the precision of the activation values).
XU and CHOUKROUN are analogous art because they are from the same field of endeavor as the claimed invention, namely quantization of a neural network in which precision of weights of activation values are changed. XU teaches a method for quantizing a model to generate a trained machine learning model. CHOUKROUN provides the additional functionality by disclosing a method for quantization in which the precision of the one or more first activation values being changed based at least in part on corresponding activation values computed over multiple training iterations. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have improved upon the machine learning system of XU and WU with this feature of CHOUKROUN, because it would provide an efficient way by which to increase the precision of the activation values while training a neural network, as suggested by CHOUKROUN.
Claim 17
The combination of XU and WU teaches “The system of claim 16”, as discussed above.
XU further teaches “wherein the one or more parameters comprise one or more weights and one or more activation values calculated during training” ([0031], XU: “the new compressed sparse-quantized weight data 250 that is determined through the compression performed (during training) by network compression engine 205. In some implementations, during back-propagation, the loss function may be based on the sparse-quantized weights (and resulting activation values derived by convolving the sparse-quantized weights with the activation values of the preceding neural network layer)”).
Claim 22
The combination of XU and WU teaches “The system of claim 16”, as discussed above.
XU further teaches “wherein parameters of the second set of quantized parameters are represented using fewer bits than full precision floating point representation” ([0024], XU: “Network quantization is another popular compression technique, which is utilized to reduce the number of bits required to represent each weight (or activation value) in a network… to quantize weights to 8-bit or higher precision from a pre-trained full-precision network… that have low-precision weights and activations using low-precision parameter gradients”).
Claim 23
The combination of XU and WU teaches “The system of claim 16”, as discussed above.
XU further teaches “wherein the first trained machine-learning model is a mixed-precision deep neural network (DNN) model” ([0054], XU: “FIG. 6 is a simplified flow diagram 600 illustrating an example technique for generating a compressed version of a neural network model (e.g., a convolutional neural network or another deep neural network)”; (EN): Paragraph [0095] of the instant specification states “In the first step, QAT training (as mentioned as a part of process 502) is performed by adding quantization operations to all weights and activations of the network except for the last layer. This results in a mixed-precision DNN model”, but does not appear to explicitly define a mixed-precision DNN model. As such, as written and in light of the instant specification, the BRI of a mixed-precision DNN model includes a DNN model which contains at least two values with different precision levels, otherwise known as bit width encodings in the computational space. By teaching a method for training a model which retains “the full-precision weights of a neural network (e.g., in full-precision weight data 245) along with the new compressed sparse-quantized weight data 250 that is determined through the compression performed (during training)” ([0031], XU), XU teaches a mixed-precision model).
WU further teaches “and the second trained machine learning model is an 8-bit DNN model for implementation on one or more deep learning accelerators (DLAs)” (pages 1-2, section 1, paragraphs 1-2, WU: “It is becoming commonplace to train neural networks in 16-bit floating-point formats, either IEEE fp16 [35] or bfloat16 [57], supported by most DL accelerators… and a number of emerging accelerator designs also provide significant acceleration for int8 operations”).
Claim 24
The combination of XU and WU teaches “The system of claim 23”, as discussed above.
XU further teaches “wherein the one or more processors are further to: use the second trained machine learning model to perform, on the one or more DLAs: object detection, image classification, speech recognition, instance segmentation, or semantic segmentation” ([0019], XU: “devices may utilize neural network models in connection with detecting persons, animals, or objects within their respective environments and/or conditions, characteristics, and events within these environments based on sensor data”).
Claim 25
The combination of XU and WU teaches “The system of claim 16”, as discussed above.
XU further teaches “wherein the one or more processors are further to deploy the second trained machine learning model, over a network, to one or more computing devices to perform inferencing” ([0020], XU: “in some implementations, wireless network connections (e.g., facilitated by network access points and gateway devices (e.g., 115, 120)) may be utilized to transfer neural network models onto devices (e.g., 120, 125, 130, 135”; (EN): The examiner notes that XU discusses the utilization of these models for object detection (“devices may utilize neural network models in connection with detecting persons, animals, or objects within their respective environments and/or conditions, characteristics, and events within these environments based on sensor data”, [0019], XU), which, as written and in light of the instant specification, is encompassed by inferencing).
XU, WU, CHOUKROUN and BANNER
Claims 2-4 and 18-21 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of XU and WU and CHOUKROUN, in view of “Scalable Methods for 8-bit Training of Neural Networks” by Ron Banner et al., as included with the Non-Final Office Action mailed September 25, 2024, referenced herein as BANNER.
Claim 2
The combination of XU and WU and CHOUKROUN teach “The method of claim 1”, as discussed above.
XU further teaches “wherein the precision of the one or more first weights and the one or more first activation values are changed while training the portion of the neural network during a forward pass by using an absolute maximum value of the one or more first weights” ([0044], XU: “such as shown in FIG. 4, in each forward pass of training… the sparsified weights may then be quantized, for instance with a min-max uniform quantization function (e.g., the minimum and maximum values of the non-zero weights)”; (EN): The examiner notes that using a maximum value encompasses cases which utilize the maximum value in addition to other value(s)). CHOUKROUN further teaches an average minimum activation value in para 148.
Although the combination of XU and WU and CHOUKROUN substantially discloses the claimed invention, the combination of XU and WU and CHOUKROUN does not appear to explicitly teach “and a running average of absolute maximum values of the one or more first activation values over the training”.
However, in the same field, analogous art BANNER provides this additional functionality by teaching “and a running average of absolute maximum values of the one or more first activation values over the training” (page 10, APPENDIX, Section B: “Quantization methods”, BANNER: “to be the average of the absolute maximum and minimum values”; (EN): As discussed above, the examiner notes that using maximum values encompasses cases which utilize maximum values in addition to other value(s). The examiner notes that BANNER defines the clamping values utilized to compute a quantized output for a specific set of activation values with this average value, as discussed in (page 10, APPENDIX, Section B: “Quantization methods”, BANNER)).
XU and BANNER are analogous art because they are from the same field of endeavor as the claimed invention, namely optimizing machine learning models. The combination of XU and WU and CHOUKROUN teaches a method for quantizing layers of machine learning models to generate trained machine learning models, but does not appear to distinctly disclose utilizing a running average of absolute maximum values of one or more activation values over the training. BANNER provides the additional functionality by disclosing a method for quantization which utilizes an average of the absolute maximum and minimum values for activation quantization. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have improved upon the machine learning system of the combination of XU and WU and CHOUKROUN with BANNER’s method of activation quantization “Since the activations can have a high dynamic range which can be aggressively clamped as shown by [28] we defined its clamping values to be the average of absolute maximum and minimum values of K chunks. This reduces the dynamic range variance and allows smaller quantization steps” (page 10, APPENDIX, Section B: “Quantization methods”, BANNER) given this method “has significantly higher tolerance to quantization noise and improved computational complexity” (page 1, ABSTRACT, BANNER), as suggested by BANNER.
Claim 3
The combination of XU, WU, and BANNER teaches “The method of claim 2”, as discussed above.
XU further teaches “further comprising calculating gradients for the one or more first weights and the one or more first activation values using a straight-through estimation (STE)” ([0045], XU: “the gradients calculation… may be approximated, for instance, with the straight-through estimator (STE) technique”; (EN): The examiner notes that XU discusses this technique with reference to “weights and activations” ([0045], XU)).
Claim 4
The combination of XU, WU, and BANNER teaches “The method of claim 3”, as discussed above.
XU further teaches “further comprising updating the portion of the neural network, during the training, by using the gradients in a backward-propagation pass” ([0042], XU: “during backpropagation within the same training iteration, a loss function 450 may be applied to the quantized activation values 445 to generate gradient values 455… During the training iteration, the full precision weights of the particular layer (L) may be maintained and the gradient applied to the full precision weights (e.g., 410) to update these full precision weight values. These updated weights may then be used in a subsequent training iteration”).
Claim 18
The combination of XU and WU teaches “The system of claim 17”, as discussed above.
XU further teaches “wherein the one or more processors are further to: quantize the one or more parameters of the one or more layers of the neural network during a forward pass by using an absolute maximum value of the one or more weights” ([0044], XU: “such as shown in FIG. 4, in each forward pass of training… the sparsified weights may then be quantized, for instance with a min-max uniform quantization function (e.g., the minimum and maximum values of the non-zero weights)”; (EN): The examiner notes that using a maximum value encompasses cases which utilize the maximum value in addition to other value(s)). Although the combination of XU and WU substantially discloses the claimed invention, the combination of XU and WU does not appear to explicitly teach “and a running average of absolute maximum values of the one or more activation values over the training”. CHOUKROUN further teaches an average minimum activation value in para 148.
Although the combination of XU and WU and CHOUKROUN substantially discloses the claimed invention, the combination of XU and WU and CHOUKROUN does not appear to explicitly teach “and a running average of absolute maximum values of the one or more activation values over the training”.
However, in the same field, analogous art BANNER provides this additional functionality by teaching “and a running average of absolute maximum values of the one or more activation values over the training” (page 10, APPENDIX, Section B: “Quantization methods”, BANNER: “to be the average of the absolute maximum and minimum values”; (EN): As discussed above, the examiner notes that using maximum values encompasses cases which utilize maximum values in addition to other value(s). The examiner notes that BANNER defines the clamping values utilized to compute a quantized output for a specific set of activation values with this average value, as discussed in (page 10, APPENDIX, Section B: “Quantization methods”, BANNER)).
XU and BANNER are analogous art because they are from the same field of endeavor as the claimed invention, namely optimizing machine learning models. The combination of XU and WU and CHOUKROUN teaches a method for quantizing layers of machine learning models to generate trained machine learning models, but does not appear to distinctly disclose utilizing a running average of absolute maximum values of one or more activation values over the training. BANNER provides the additional functionality by disclosing a method for quantization which utilizes an average of the absolute maximum and minimum values for activation quantization. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have improved upon the machine learning system of the combination of XU and WU and CHOUKROUN with BANNER’s method of activation quantization “Since the activations can have a high dynamic range which can be aggressively clamped as shown by [28] we defined its clamping values to be the average of absolute maximum and minimum values of K chunks. This reduces the dynamic range variance and allows smaller quantization steps” (page 10, APPENDIX, Section B: “Quantization methods”, BANNER) given this method “has significantly higher tolerance to quantization noise and improved computational complexity” (page 1, ABSTRACT, BANNER), as suggested by BANNER.
Claim 19
The combination of XU, WU, and BANNER teaches “The system of claim 18”, as discussed above.
XU further teaches “wherein the one or more processors are further to calculate gradients for the one or more weights and the one or more activation values by determining the gradients based, at least in part, on a threshold function” ([0044], XU: “the full-precision weights of a subject network are sparsified based on a layer-wise threshold that is computed from the statistics of the full-precision weights in each layer”; (EN): XU specifies that the gradient values are “determined from a corresponding loss function” ([0030], XU) where “the loss function may be based on the sparse-quantized weights (and resulting activation values derived by convolving the sparse-quantized weights with activation values of the preceding neural network layer)” ([0031], XU)).
Claim 20
The combination of XU, WU, and BANNER teaches “The system of claim 19”, as discussed above.
XU further teaches “wherein the one or more processors are further to update the neural network, during the training, by using the gradients in a backward-propagation pass” ([0042], XU: “during backpropagation within the same training iteration, a loss function 450 may be applied to the quantized activation values 445 to generate gradient values 455… During the training iteration, the full precision weights of the particular layer (L) may be maintained and the gradient applied to the full precision weights (e.g., 410) to update these full-precision weight values. These updated weights may then be used in a subsequent training iteration”).
Claim 21
The combination of XU, WU, and BANNER teaches “The system of claim 19”, as discussed above.
XU further teaches “wherein the one or more weights or the one or more activation values have finer granularity in representation than 8-bit quantization” ([0040], XU: “a set of parameters (e.g., 310) may be defined and provided as inputs to the network compression engine 205 to specify operation of the network compression engine 205. For instance, such parameters 310 may include… a quantization level (e.g., 2-bit, 4-bit, or another quantization level)”; (EN): As discussed above, XU discloses methods “to quantize weights, gradients, and/or activation values” ([0028], XU)).
Response to Arguments
Applicant’s arguments filed 12/8/25 with regards to the 103 rejections have been fully considered but are not persuasive. Applicant argues that the cited prior art of the previous Action fails to teach the newly amended limitation of the independent claims, namely the precision of the one or more first activation values being changed based at least in part on corresponding activation values computed over multiple training iterations . However, CHOUKROUN is brought in to show this feature.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
a) “Iteration (machine learning)” by Radiopaedia.org demonstrates that a person of reasonable skill in the art will appreciate that each iteration of machine learning generates a set of parameters, where each set of parameters defines a machine learning model.
b) Brick (WO 2020131968 A1) shows effective initialization of quantization of a neural network in which a plurality of training iterations of the quantization forward transfer convolution method are applied to a model weight value set.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Steven Sax whose telephone number is (571)272-4072. The examiner can normally be reached Monday through Friday from 9am to 5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Usmaan Saeed can be reached on 571-272-4046. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/STEVEN P SAX/Primary Examiner, Art Unit 2146