DETAILED ACTION
Remarks
Claims 1-20 have been examined and rejected. This Office Action is responsive to the amendment filed on 10/21/2025, which has been entered in the above identified application.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1, 3-11 and 13-22 are presented for examination.
Response to Amendment
Applicant’s amendment filed on 10/21/2025 has been entered. Claims 1, 3, 10, 11, 13, 19 and 20 are amended. Claims 1, 3-11 and 13-22 are pending in the application.
Claim Objections
Claims 21 and 22 are objected to because of the following informalities:
Claim 21 [line 4] and claim 22 [line 5]: “optimize” should be “optimizes”
Appropriate corrections are required.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1, 3-8 and 10, 11, 13-22 are rejected under 35 U.S.C. 103 as being unpatentable over Benoit et al (“Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference”) hereafter Benoit, further in view of Ruihao et al (“Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks”) hereafter Ruihao, and further in view of Nagel et al (US 20200302299 A1) hereafter Nagel.
Benoit was cited in the IDS filed on 03/22/2022.
Ruihao was cited in the IDS filed on 03/22/2022.
With respect to claim 1, Benoit teaches a computer-implemented method of training a neural network that comprises a plurality of computational blocks, the neural network having been previously configured for an inference task and including a set of real-valued weights, the method (the method involves integer-arithmetic-only inference of a convolution layer and training with simulated quantization of the convolution layer. Frameworks such as quantization scheme, quantized inference framework and quantized training framework those may be applied for efficient classification and detection systems based on MobileNets and significant improvements in latency-vs-accuracy tradeoffs in ImageNet classification, object detection and other inference tasks [Figure 1.1. Integer-arithmetic-only quantization]), comprising:
performing a plurality of training iterations, each training iteration comprising:
for each computational block, applying a respective quantization function to the set of respective real-valued weights of the computational block to generate a respective set of quantized weights that are scaled based on a respective scaling factor to fall within a respective quantization range that is symmetrically centered at zero and comprises a defined number of uniform quantization levels corresponding to integer multiples of the respective scaling factor (the training includes floating point training and quantizing the resulting weights. Weights are quantized before they are convolved with the input. For each layer, quantization is performed by a number of quantization levels and range, where r is real-valued weight, [a;b] is the quantization range, n is quantization levels. Quantization ranges are treated differently, the quantized weights are ranged in [-127, 127], wherein the center is 0. The learned quantization parameters map to the scale S and zero-point Z in equation S = s(a,b,n) and Z = z(a,b,n). The scaling factor such as quantization scales S1, S2, S3 are taken into consideration when computing quantized weights [pages 4, 5; 2.4. Implementation of a typical fused layer, 3. Training with simulated quantization]);
for each computational block, computing a set of respective output activations for the computational block based on a respective set of input activations and the respective set of quantized weights (Computing the quantized bias-addition includes computing the product of the weights and the input activations based on the quantization scales. The output activations are computed based on the quantized weights and the input activations, wherein the three steps in the training process includes scale down to final scale used by 8-bit output activations, cast down to unit8 and apply the activation function to yield the final 8-bit output activation [page 4, 2.4. Implementation of a typical fused layer]); and
when performing the plurality of training iterations, incrementally reducing a smoothness of the respective quantization functions applied by the computational blocks for multiple training iterations of the plurality of training iterations (for activations, ranges depend on the input. To estimate the ranges, we collect [a;b] ranges seen on activations during training and then aggregate them via exponential moving averages with the smoothing parameter being close to 1, so that observed ranges are smoothed across thousands of training steps. This concept may be applied to the quantization function for multiple training steps [page 5, 3.1. Learning quantization ranges]); and
storing, for each of the computational blocks, a quantized weights version of the adjusted set of respective real-valued weights that optimizes performance of the neural network on the inference task, at a completion of the plurality of training iterations (one approach of simulated quantization is to train in floating point and then quantize the resulting weights, sometimes with additional post-quantization training for fine-tuning. Some post-quantization include: large differences in ranges of weights for different output channels, and outlier weight values that make all remaining weights less precise after quantization. The forward propagation pass simulates quantized inference as it will happen in the inference engine by implementing in floating-point arithmetic the rounding behavior of the quantization scheme. The system of Benoit represents the preparation (simulation) and the finalization (storage) of a model specifically designed for high-performance inference tasks. The “simulated quantized inference” during training is the mechanism used to achieve the “optimized performance” of the final “quantized weights.” [page 5, 3. Training with simulated quantization]).
However, Benoit does not disclose computing a cost for the training iteration based on the respective output activations of the computational blocks and relative alignments of the respective quantized weights of the computational blocks with the uniform quantization levels of the respective quantization ranges, wherein computing the cost comprises applying a scaling factor regularization function to output regularization cost values based on the respective quantized weights and the respective scaling factors, the scaling factor regularization function being configured to generate a regularization cost value that decreases the closer that the respective quantized weights each align with one of the uniform quantization levels; and for each computational block, adjusting the set of respective real- valued weights and the respective scaling factor with an objective of reducing the computed cost in one or more following training iterations.
In the same field of endeavor, Ruihao teaches computing a cost for the training iteration based on the respective output activations of the computational blocks and relative alignments of the respective quantized weights of the computational blocks with the uniform quantization levels of the respective quantization ranges (the differentiable soft quantization (DSQ) function is proposed to optimize the DSQ training and network parameters. The algorithm 1 of the training includes input activation a, parameters weight w and similarity factor, and output activation o. The computational cost of SADDW and MLA are quite different, SADDW takes extra computational cost than MLA. DSQ function is used to approximately model the uniform quantizer, wherein k is the coefficient that determines the shape of the function, and k is aligned with the quantization levels [page 3, 3.2. Quantization Function; page 5, 3.5. Training and Deploying and Figure 4]); wherein computing the cost comprises applying a scaling factor regularization function to output regularization cost values based on the respective quantized weights and the respective scaling factors, the scaling factor regularization function being configured to generate a regularization cost value that decreases the closer that the respective quantized weights each align with one of the uniform quantization levels ([the differentiable soft quantization (DSQ) function is proposed to optimize the DSQ training and network parameters. The algorithm 1 of the training includes input activation a, parameters weight w and similarity factor, and output activation o. The computational cost of SADDW and MLA are quite different, SADDW takes extra computational cost than MLA. DSQ function is used to approximately model the uniform quantizer, wherein k is the coefficient that determines the shape of the function, and k is aligned with the quantization levels [page 3, 3.2. Quantization Function; page 5, 3.5. Training and Deploying and Figure 4]]);
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have incorporated the concept of applying differentiable soft quantization to bridge the gap between full-precision and low-bit networks as suggested by Ruihao into the concept of applying quantization scheme technique using integer-only arithmetic as suggested by Benoit because both of the systems addressing the process of applying the quantization process on the weights and the input activations to get the output activations based on the scaling factor. Doing so would be desirable because the concept of Benoit would be more efficient by reducing the computational cost and expensive memory by employing a more promising network compression solution such as DSQ to reduce the network storage and accelerate the inference speed using different types of quantizers (Ruihao, [page 1, 1. Introduction]).
However, the combination of Benoit and Ruihao does not teach for each computational block, adjusting the set of respective real- valued weights and the respective scaling factor with an objective of reducing the computed cost in one or more following training iterations.
In the same field of endeavor, Nagel teaches for each computational block, adjusting the set of respective real- valued weights and the respective scaling factor with an objective of reducing the computed cost in one or more following training iterations (a neural network quantization method for performing cross layer is illustrated. The processor may scale each of the output channel weights by corresponding scaling factor. The processor may change or adjust the weights within the layer by scaling the output channel weights by a corresponding scaling factor [par. 0063-0065 and FIG. 4]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have incorporated the concept of scaling each output channel weight of a first layer of the neural network by a corresponding scaling factor as suggested by Nagel into the combination of Benoit and Ruihao because all of the systems addressing the process of performing quantization in the neural network. Doing so would be desirable because the combination of Benoit and Ruihao would be more efficient by scaling each weight of each layer of the neural work by a corresponding scaling factor, as well as determining the scaling factor to equalize ranges of weight tensors or output channel weights to minimize the quantization error (Nagel, [par. 0003-0004]).
With respect to claim 3, the combination of Benoit, Ruihao and Nagel teaches wherein the neural network comprises an input block prior to the plurality of computational blocks, and an output block following the plurality of computational blocks, the input block, plurality of computational blocks, and output block arranged as respective layers of the neural network to collectively process input feature tensors received at the input block representing objects and output, from the output block, respective predictions for the objects (Benoit, input block, output block and a plurality of computational blocks are described in the training of integer-arithmetic-only inference and training with simulated quantization. All input variables and computations are carried out using 32-bit floating-point arithmetic [page 2, 1. Introduction and Figure 1.1]), and
wherein, for each training iteration, the respective set of input activations for each of the plurality of computational blocks following a first computational block is the set of output activations computed by a preceding computation block of the plurality of computational blocks, and each training iteration comprises, for each computational block, applying a respective activation quantization function to the respective set of respective set of input activations to generate a respective set of quantized activations (Benoit, for each layer, the quantization function is applied with a number of quantization levels and quantization range, with r is real-valued number to be quantized and n is quantization level which is fixed for all layers [page 5, 3. Training with simulated quantization]), wherein for each computational block, computing the set of respective output activations for the computational block is based on a matrix multiplication of the respective set of quantized activations and the respective set of quantized weights for the computational block (Benoit, the real numbers are applied in the multiplication of two square NxN matrices along with the quantization parameters to perform the matrix multiplication. The quantization scales S1, S2, S3 are also used in the computation to generate the output activations [page 3, 2.2. Integer-arithmetic-only matrix multiplication]);
wherein, for each training iteration, computing the cost comprises computing an error between the respective predictions for the objects and expected values for the objects (Ruihao, clipping and rounding together cause the quantization error. When the quantizer clips more, the clipping error increases and the rounding error decreases. When clipping error dominates the whole quantization error, the outlier’s gradients will be large and thus serve as the major power for weight updating [page 4, 3.4. Balancing clipping error and rounding error]).
With respect to claim 4, the combination of Benoit, Ruihao and Nagel teaches wherein, for each computational block, applying the respective activation quantization function generates the respective set of quantized activations scaled within a respective activation quantization range that is symmetrically centered at zero and comprises a defined number of uniform activation quantization levels (Ruihao, the differentiable soft quantization (DSQ) function is proposed to optimize the DSQ training and network parameters. The algorithm 1 of the training includes input activation a, parameters weight w and similarity factor, and output activation o. The computational cost of SADDW and MLA are quite different, SADDW takes extra computational cost than MLA. DSQ function is used to approximately model the uniform quantizer, wherein k is the coefficient that determines the shape of the function, and k is aligned with the quantization levels [page 3, 3.2. Quantization Function; page 5, 3.5. Training and Deploying and Figure 4]).
With respect to claim 5, the combination of Benoit, Ruihao and Nagel teaches wherein the computational blocks include at least one computational block that implements one of a fully connected neural network layer or a convolution neural network layer (Benoit, activations are quantized at points where they would be during inference after the activation function is applied to a convolutional or fully connected layer’s output [page 5, 3. Training with simulated quantization]).
With respect to claim 6, the combination of Benoit, Ruihao and Nagel teaches wherein for each computational block, the respective quantization function is a piecewise function comprising a plurality of repeated, shifted functions that each correspond to a respective uniform quantization level, and wherein incrementally reducing the smoothness of the respective quantization functions comprises incrementally increasing a slope of the function.
With respect to claim 7, the combination of Benoit, Ruihao and Nagel teaches wherein adjusting the set of respective real-valued weights and the respective scaling factor for each computational block is performed using a derivative of a corresponding one of the plurality of repeated, shifted functions for at least some of the plurality of training iterations.
With respect to claim 8, the combination of Benoit, Ruihao and Nagel teaches wherein incrementally reducing the smoothness of the respective quantization functions is performed in a linear manner across at least a first group of the plurality of training iterations and is suspended when a predetermined criteria is reached, following which a quantization function of constant smoothness is used as the respective quantization functions for a remainder of the plurality of training iterations.
With respect to claim 10, the combination of Benoit, Ruihao and Nagel teaches deploying a trained version of the neural network that includes the quantized weights version for each of the computational blocks, the trained version of the neural network representinq a compressed neural network (DSQ is implemented using Pytorch as a flexible module that can be easily inserted to the binary or uniform quantization models. Two strategies are used: moving average statistics and optimization by backward propagation. ARM NEON 8-bit with GEMM kernels are also used that accelerate the inference even when using the extreme lower bits [pages 5-8, 4. Experiments]).
With respect to claim 11, it is a processing unit that is corresponding to the method of claim 1. Therefore, it is rejected for the same reason as claimed in claim 1 above.
With respect to claim 12, it is a processing unit that is corresponding to the method of claim 2. Therefore, it is rejected for the same reason as claimed in claim 2 above.
With respect to claim 13, it is a processing unit that is corresponding to the method of claim 3. Therefore, it is rejected for the same reason as claimed in claim 3 above.
With respect to claim 14, it is a processing unit that is corresponding to the method of claim 4. Therefore, it is rejected for the same reason as claimed in claim 4 above.
With respect to claim 15, it is a processing unit that is corresponding to the method of claim 5. Therefore, it is rejected for the same reason as claimed in claim 5 above.
With respect to claim 16, it is a processing unit that is corresponding to the method of claim 6. Therefore, it is rejected for the same reason as claimed in claim 6 above.
With respect to claim 17, it is a processing unit that is corresponding to the method of claim 7. Therefore, it is rejected for the same reason as claimed in claim 7 above.
With respect to claim 18, it is a processing unit that is corresponding to the method of claim 8. Therefore, it is rejected for the same reason as claimed in claim 8 above.
With respect to claim 19, it is a processing unit that is corresponding to the method of claim 10. Therefore, it is rejected for the same reason as claimed in claim 10 above.
With respect to claim 20, it is a non-transitory computer readable storage that is corresponding to the method of claim 1. Therefore, it is rejected for the same reason as claimed in claim 1 above.
With respect to claim 21, the combination of Benoit, Ruihao and Nagel teaches deploying a trained version of the neural network that includes the quantized weights version for each of the computational blocks on a computationally constrained hardware device, wherein the quantized weights version of the adjusted set of respective real-valued weights optimize performance of the neural network on the inference task when the neural network is deployed on the computationally constrained hardware device (For deploying on devices with limited computing resources, low-bit computation kernels are implemented to accelerate the inference on ARM architecture. In convolution networks, multiply and accumulation are the core operations of General Matrix Multiply which can be efficiently completed by the MLA instruction on ARM NEON. DSQ is implemented using Pytorch as a flexible module that can be easily inserted to the binary or uniform quantization models. Two strategies are used: moving average statistics and optimization by backward propagation [pages 5-8, 3.5. Training and Deploying & 4. Experiments]).
With respect to claim 22, it is a processing unit that is corresponding to the method of claim 21. Therefore, it is rejected for the same reason as claimed in claim 21 above.
Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Benoit et al (“Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference”) hereafter Benoit, further in view of Ruihao et al (“Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks”) hereafter Ruihao, and further in view of Nagel et al (US 20200302299 A1) hereafter Nagel, as claimed in claim 1 above, and further in view of Bialkowski et al (US 20100008594 A1) hereafter Bialkowski.
Benoit was cited in the IDS filed on 03/22/2022.
Ruihao was cited in the IDS filed on 03/22/2022.
With respect to claim 9, the combination of Benoit, Ruihao and Nagel teaches all limitations as claimed in claim 1 above.
However, the combination of Benoit, Ruihao and Nagel does not disclose wherein the defined number of uniform quantization levels is 15.
In the same field of endeavor, Bialkowski teaches wherein the defined number of uniform quantization levels is 15 (an image quality can be seen using quantization level using one or two quantization when used in a video coding method. With a quantization level of 15 the amplitudes from 0-14 or from 15-29 are each summarized to a reconstruction value [par. 0008]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have incorporated the concept of transcoding the quantized digital signals as suggested by Bialkowski into the combination of Benoit, Ruihao and Nagel because all of the systems addressing the process of performing quantization in the neural network. Doing so would be desirable because the combination of Benoit, Ruihao and Nagel would be more efficient by including a quantization level of 15 when performing quantization to enhance the image quality (Bialkowski, [par. 0008]).
Response to Arguments
The examiner respectfully acknowledges the applicant’s amendments to claims 1, 3, 10, 11, 13, 19 and 20.
Applicant’s amendments filed on 10/21/2025 regarding claim objection to claims 10 and 25 have been considered and are consequently withdrawn. New matter has been added regarding newly added claims 21 and 22.
Applicant’s arguments filed on 10/21/2025 regarding claim rejections to claims 1-20 under 35 USC 101 have been considered and are consequently withdrawn.
Applicant’s arguments filed on 10/21/2025 regarding claim rejections to claims 1-20 under 35 USC 103 have been fully considered but are not persuasive.
Applicant argued that “Ruihao is directed toward a method of differentiable soft quantization (DSQ). DSQ approximates a quantization function using a differentiable asymptotic function p(x), for example, using tanh with an additional coefficient k that adjusts the shape of the asymptotic function (e.g., the steepness of the slope) and a scaling parameter s for ensuring that tanh functions of p for adjacent intervals can be connected. The Applicant submits that neither the coefficient k, the scaling parameter s or the similarity factor a of Ruihao can be considered to be equivalent to the scaling factor as recited in claim 1, at least because the uniform quantization levels of Ruihao do not correspond to integer multiples of either the coefficient k, the scaling parameter s or the similarity factor a, as recited in amended claim 1. Accordingly, the Applicant submits that Ruihao fails to disclose any "scaling factor regularization function" as recited in amended claim 1.”
Examiner respectfully disagrees.
In the teaching of Ruihao, Differentiable Soft Quantization (DSQ) does not specifically use “scaling factor regularization function”, however, Ruihao teaches several closely related concepts that fulfill a similar role in optimization:
Learnable Clipping Range of DSQ that treats the upper and lower bounds (l,u) of the quantization range as parameters that are optimized alongside the network’s weights. These bounds directly determine the scaling factor used during the quantization process [page 3, 3.1. Preliminaries and page 4, 3.4. Balancing clipping error and rounding error]
Soft Function is used based on hyperbolic tangent that gradually becomes more like a standard step function during training. This acts as a form of implicit regularization, smoothing the optimization landscape early in training to help the scaling factors and weights converge more easily [page 6, 4.2.3. Evolution]
Balanced Loss by optimizing the clipping values (l,u) and a similarity factor 𝛼 simultaneously that aims to balance two types of error: clipping error (data loss due to outside the range) and rounding error (data loss due to precision limits). This keeps the scaling factors from becoming extreme or unstable [page 4, 3.4. Balancing clipping error and rounding error].
The entire DSQ framework is designed to automatically learn and stabilize the scaling factor through differentiable, evolutionary training process.
Parameter 𝛼, clipping bounds (l,u) and efficient k are directly related to the scaling factor concept. The parameter 𝛼 is a characteristic variable that controls how closely the differentiable soft function mimics a standard step function. It acts as a control parameter for the soft quantization curve. The clipping bounds are learnable parameters. Instead of just using a fixed “scaling factor regularization function”, the DSQ framework optimizes these clipping values with the task loss to automatically minimize the combined error of clipping and rounding. Finally, the variable k acts as a coefficient within the hyperbolic tangent function tanh(k,x) that is used to define the steepness of the approximation. K and 𝛼 are mathematically related, wherein k is the internal mathematical lever that allows the scaling of the gradients during backpropagation.
Therefore, independent claims 1, 11 and 20 are not patent eligible for at least the reasons above. Dependent claims 3-10, 13-19 and 21-22, those either directly or indirectly depended on claims 1, 11 and 20, are also not patent eligible for the same reasons.
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Quoc Phung whose telephone number is (703) 756 1330. The examiner can normally be reached on Monday through Friday from 9am to 5pm PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jennifer Welch can be reached on 571-272-7212. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Q.L.P./Examiner, Art Unit 2143
/JENNIFER N WELCH/Supervisory Patent Examiner, Art Unit 2143