DETAILED ACTION
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR
1.17(e), was filed in this application after final rejection. Since this application is eligible for continued
examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the
finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's
submission filed on 17 February 2026 has been entered.
Response to Amendment
The amendment filed on 17 February 2026 has been entered.
Claims 1-19 are pending.
Claims 1, 6, 10, 12, 15, 19 are amended.
Claims 2, 11 are cancelled.
Claims 20-23 are new.
Claims 1, 3-10, 12-23 will be pending.
Response to Arguments
Applicant’s remarks, regarding the rejections of claims under 35 USC 101, have been fully considered.
Applicant’s arguments, filed 17 February 2026, with respect to Claims 1-19 under 35 USC 101 have been fully considered and are persuasive. The rejections of Claims 1-19 under 35 USC 101 have been withdrawn.
Applicant’s remarks, regarding the rejections of claims under 35 USC 103, have been fully considered.
Applicant respectfully disagrees with the rejection of Claim 1 but has amended the claim to expedite allowance. As amended, Claim 1 recites in part:
wherein:
the reference layer is structurally subsequent to the quantization target layer
in the neural network's forward path;
the statistical information represents a distribution range of the layer
parameters related to the reference layer; and
the determining step determines the quantization range for the layer
parameters related to the quantization target layer such that at least part of a distribution range of the layer parameters related to the quantization target layer is excluded from the quantization range for the layer parameters related to the quantization target layer, the excluded part of the distribution range of the layer parameters related to the quantization target layer matching a region lying outside
the distribution range of the layer parameters related to the reference layer.
Claim 1 has been amended to include feature(s) similarly recited in Claim 2, as previously presented, but also additional amendments.
Applicant respectfully asserts that Liu and Ha, alone or in combination, do not appear to explicitly teach each and every feature of amended Claim 1.
Applicant’s arguments have been considered, but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 21, 23 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Claim 21 and analogous claim 23 recites the limitation "the activation layer" in line 2. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, the term "the activation layer" has been construed to be “an activation layer”.
Claim 21 and analogous claim 23 recites the limitation "the layer parameters" in lines 3, 5. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, the term "the layer parameters" has been construed to be “the layer parameters related to the quantization target layer” of claim 1 and claim 10, respectively.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1, 6-10, 15-19 are rejected under 35 U.S.C. 103 as being unpatentable over Liu et al. (U.S. Pre-Grant Publication No. 20210374510, hereinafter ‘Liu'), in view of Ha et al. (U.S. Pre-Grant Publication No. 20210201117, hereinafter ‘Ha'), and further in view of Mathew et al. (U.S. Pre-Grant Publication No. 20210224658, hereinafter 'Mathew').
Regarding claim 1 and analogous claims 10 and 19, Liu teaches A computer-implemented method of quantizing a neural network that comprises sequential layers, each of the sequential layers having weights and being configured to output, using the weights, features to a subsequent one of the sequential layers or another device, the sequential layers including a quantization target layer and a reference layer other than the quantization target layer, the method comprising ([0039] The step S11 includes: determining a plurality of pieces of data to be the sequential layers including a quantization target layer quantized in target data of a layer to be quantized, where each piece of data to be quantized is a subset of the target data, the target data is any kind of data to be operated and quantized in the layer to be quantized, and the data to be operated includes at least one of an input neuron, a weight, a bias, and a gradient.; [0040] A layer to be quantized in the neural network may be any layer in the neural network. Some or all of the layers in the neural network may be determined as the layers to be quantized according to requirements. When the neural network a reference layer other than the quantization target layer includes a plurality of layers to be quantized, each layer to be quantized may be continuous or discontinuous. Different neural networks may have different types of layers to be quantized, for example, the layer to be quantized may be a convolution layer, a fully connected layer, etc. The present disclosure does not limit the quantity and types of layers to be quantized.; [0041] In a possible implementation manner, the data to be operated includes at least one of an input neuron, a weight, a bias, and a gradient. The at least one of an input neuron, each of the sequential layers having weights a weight, a bias, and a gradient in the layer to be quantized may be quantized according to requirements. The target data is any kind of data to be operated. For example, when the data to be operated is a neuron, a weight, and a bias, and the neuron and the weight to be quantized, then the neuron is target data 1, and the weight is target data 2.; [0055] The quantization parameter may be determined by looking up the correspondence between the data feature and the quantization parameter configured to output, using the weights, features to a subsequent one of the sequential layers or another device according to the data feature of each data to be quantized.):
retrieving, with a processor from the reference layer, statistical information on layer parameters related to the reference layer, the layer parameters including at least one of the features and weights of the reference layer, each of the layer parameters being comprised of predetermined bits ([0052] The method of determining the quantization parameter corresponding to the data to be quantized may further include: directly retrieving, with a processor from the reference layer, statistical information on layer parameters related to the reference layer determining the quantization parameter corresponding to each piece of data to be quantized. The target data may not have a corresponding quantization parameter, or the target data may have a corresponding quantization parameter but the corresponding quantization parameter is not adopted by the data to be quantized. The corresponding quantization parameter may be directly set for each piece of data to be quantized, or the corresponding quantization parameter may be obtained by computing according to the data to be quantized.; [0055] The quantization parameter may be determined by looking up the correspondence between the data feature and the quantization parameter according to the data feature of each data to be quantized.);
quantizing, with a processor, selected layer parameters in the layer parameters related to the quantization target layer, the selected layer parameters being within the quantization range to lower the number of bits of each of the selected layer parameters ([0057] The step S13 includes: obtaining a quantization result of the target data according to the piece of quantized data corresponding to each piece of data to be quantized, so that an quantizing selected layer parameters in the layer parameters related to the quantization target layer operation may be performed in the layer to be quantized according to the quantization result of the target data.; [0058] The set quantization algorithm may be used to quantize the data to be quantized selected layer parameters being within the quantization range according to the quantization parameter to obtain the quantized data.).
Liu fails to teach retrieving, with a processor from the reference layer, statistical information on layer parameters related to the reference layer, the layer parameters including at least one of the features and weights of the reference layer, each of the layer parameters being comprised of predetermined bits; determining, with a processor based on the statistical information, a quantization range for the layer parameters related to the quantization target layer, the quantization range defining that extracted parameters within the quantization range are those that are subject to quantization; and quantizing, with a processor, selected layer parameters in the layer parameters related to the quantization target layer, the selected layer parameters being within the quantization range to lower the number of bits of each of the selected layer parameters,
wherein: the reference layer is structurally subsequent to the quantization target layer in the neural network's forward path; the statistical information represents a distribution range of the layer parameters related to the reference layer; and the determining step determines the quantization range for the layer parameters related to the quantization target layer such that at least part of a distribution range of the layer parameters related to the quantization target layer is excluded from the quantization range for the layer parameters related to the quantization target layer, the excluded part of the distribution range of the layer parameters related to the quantization target layer matching a region lying outside the distribution range of the layer parameters related to the reference layer.
Ha teaches retrieving, with a processor from the reference layer, statistical information on layer parameters related to the reference layer, the layer parameters including at least one of the features and weights of the reference layer, each of the layer parameters being comprised of predetermined bits ([0075] The processor 110 may generate a trained neural network by repeatedly or iteratively training an initial neural network. For example, as the initial neural network is trained, a pre-trained neural network may be generated. In this case, to secure accurate processing by a neural network, the each of the layer parameters being comprised of predetermined bits initial neural network may have floating-point parameters, for example, parameters of 32-bit floating-point precision. The layer parameters including at least one of the features and weights of the reference layer parameters may include various types of data input to and output from the neural network, for example, input activations, output activations, weights, and biases of the neural network. As the training of the neural network is repeated, the floating-point parameters of the neural network may be tuned to calculate a more accurate output for a given input.);
determining, with a processor based on the statistical information, a quantization range for the layer parameters related to the quantization target layer, the quantization range defining that extracted parameters within the quantization range are those that are subject to quantization ([0009] The determining of the a quantization range for the layer parameters related to the quantization target layer corresponding second quantization range may include, for each channel: performing the quantization range defining that extracted parameters within the quantization range are those that are subject to quantization quantization of the parameter values for each piece of data of a data set, based on the determined PDF; from a result of the quantization, determining, with a processor based on the statistical information obtaining a statistical point of the parameter values where the SQNR is maximum, for each piece of the data of the data set; calculating a critical point, based on a normalized weighted sum operation of the statistical points; and determining the second quantization range, based on the critical point.); and
quantizing, with a processor, selected layer parameters in the layer parameters related to the quantization target layer, the selected layer parameters being within the quantization range to lower the number of bits of each of the selected layer parameters ([0051] To efficiently process an operation related to a neural network, one or more embodiments of the present disclosure may convert a floating-point parameter value into a fixed-point parameter value.).
Liu and Ha are considered to be analogous to the claimed invention because they are in the same field of machine learning. In view of the teachings of Liu, it would have been obvious for a person of ordinary skill in the art to apply the teachings of Ha to Liu before the effective filing date of the claimed invention in order to efficiently process an operation related to a neural network (cf. Ha, [0051] To efficiently process an operation related to a neural network, one or more embodiments of the present disclosure may convert a floating-point parameter value into a fixed-point parameter value.).
Mathew teaches wherein: the reference layer is structurally subsequent to the quantization target layer in the neural network's forward path ([0041] FIG. 2 is a flow diagram of an example of bias calibration for a convolution layer using post training quantization with calibration. This can be done for all layers with bias parameters, not only convolution. In this example, a previous layer 201, a current layer 202, and a next layer 203 are represented.);
the statistical information represents a distribution range of the layer parameters related to the reference layer ([0023] In described examples, parametric activation functions with power-of-2 ranges (PACT2) is used to clip the activation feature map ranges. This PACT2 function estimates the range of incoming activations and clips them to a suitable power-of-2 number in a way that the clipping does not introduce too much degradation.; [0030] However, it is possible to the statistical information statistically estimate these clipping values. At 101, a set of data elements is received, such as a matrix of data that represents one layer of a portion of an image. A histogram is used to discard 0.01 percentile (1e-4 in fraction) from the tail(s) of the feature map distribution to estimate the maximum value. PACTu has only one tail and PACTs has two tails. At 102, exponential moving average (EMA) is used to smooth the maximum value over several batches using historical values 110 to find a smoothed maximum value of the distribution. At 103, this smoothed value is then expanded to the next power-of-2 to find a clipping value cc. In PACTs a single common value is used as the magnitude of ∝ since symmetric quantization is being used. Thus, in PACTs, the represents a distribution range of the layer parameters related to the reference layer maximum of the magnitude of both sides is used as ∝a.; [0031] At 104, clipping is performed as defined by expression (3) using the clipping values l, h determined at 103.; [0032] At 105, activation quantization is performed, as described in more detail below. In post training quantization as well as in quantization aware training, a histogram and EMA is used to estimate a suitable value of ∝a.); and
the determining step determines the quantization range for the layer parameters related to the quantization target layer such that at least part of a distribution range of the layer parameters related to the quantization target layer is excluded from the quantization range for the layer parameters related to the quantization target layer, the excluded part of the distribution range of the layer parameters related to the quantization target layer matching a region lying outside the distribution range of the layer parameters related to the reference layer ([0033] During calibration, the clipping thresholds for activations is estimated, as well as weights and biases for convolution and inner product layers. These clipping thresholds are herein referred to as “∝a,” “∝w,” and “∝b,” for the activations, weights and biases respectively. These values are computed separately for each layer.; [0038] such that at least part of a distribution range of the layer parameters related to the quantization target layer is excluded from the quantization range for the layer parameters related to the quantization target layer Clipping values of weights and biases are directly obtained from the weights and biases themselves. Given these definitions, the only information that needs to be conveyed to the inference engine that does quantized inference is the clipping values ∝a of the activations. These clipping values can be directly indicated in the network model as a clip operation on the activations and such a model is referred to as the calibrated model. Thus, to convey quantization information in the model, it is sufficient to insert PACT2 operations in the model wherever the activations need to be constrained.; [0024] Inserting the PACT2 function serves two purposes. First, it the excluded part of the distribution range of the layer parameters related to the quantization target layer matching a region lying outside the distribution range of the layer parameters related to the reference layer clips the activation feature map so that it is contained within a certain range. Second, the PACT2 clip function becomes part of the model graph (for example, the ONNX model; ONNX is an open format for machine learning (ML) models, allowing models to be interchanged between various ML frameworks and tools) and it is easy for the inference engine to understand what the clipping range is. Once these clipping values are known, the the determining step determines the quantization range for the layer parameters related to the quantization target layer quantization scale factors can be derived easily.).
Liu, Ha, and Mathew are considered to be analogous to the claimed invention because they are in the same field of machine learning. In view of the teachings of Liu and Ha, it would have been obvious for a person of ordinary skill in the art to apply the teachings of Mathew to Liu before the effective filing date of the claimed invention in order to clip the activation feature map so that it is contained within a certain range (cf. Mathew, [0024] Inserting the PACT2 function serves two purposes. First, it clips the activation feature map so that it is contained within a certain range. Second, the PACT2 clip function becomes part of the model graph (for example, the ONNX model; ONNX is an open format for machine learning (ML) models, allowing models to be interchanged between various ML frameworks and tools) and it is easy for the inference engine to understand what the clipping range is. Once these clipping values are known, the quantization scale factors can be derived easily.).
Regarding claim 6, Liu, as modified by Ha and Mathew, teaches The method of claim 1.
Liu teaches the statistical information represents an indicator indicative of a level of optimization of the reference layer; and the determining step determines the quantization range for the layer parameters related to the quantization target layer to thereby maximize the indicator ([0036] In a possible implementation manner, the processors mentioned in the present disclosure may include a plurality of processing units, and each processing unit may independently execute various assigned tasks, such as convolution operation task, pooling task, or fully connected task, etc. The present disclosure does not limit the processing unit and the tasks executed by the processing unit.; [0132]
The step S15 includes: statistical information represents an indicator indicative of a level of optimization of the reference layer according to the quantization error and an error threshold corresponding to each piece of data to be quantized, determining step determines the quantization range for the layer parameters related to the quantization target layer to thereby maximize the indicator adjusting the data bit width of each piece of data to be quantized to obtain an adjusted bit width corresponding to each piece of data to be quantized.; [0133] The error threshold may be determined based on the empirical value, and the error threshold may be used to indicate an expected value of the quantization error. When the quantization error is greater than or less than the error threshold, the data bit width corresponding to the data to be quantized may be adjusted to obtain the adjusted bit width corresponding to the data to be quantized. The data bit width may be adjusted to a longer bit width or a shorter bit width to increase or decrease the quantization precision.).
Liu, Ha, and Mathew are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 7, Liu, as modified by Ha and Mathew, teaches The method of claim 1.
Liu teaches wherein: the quantizing step includes: a step of determining first and second clip thresholds based on the quantization range ([0132] The step S15 includes: according to the quantization error and an error threshold corresponding to each piece of data to be quantized, adjusting the data bit width of each piece of data to be quantized to obtain an adjusted bit width corresponding to each piece of data to be quantized.; [0134] The step of determining first and second clip thresholds based on the quantization range error threshold may be determined according to the maximum acceptable error. When the quantization error is greater than the error threshold, it means that the quantization precision cannot meet expectations, and the data bit width needs to be adjusted to a longer bit width. A smaller error threshold may be determined based on a higher quantization precision. When the quantization error is less than the error threshold, it means that the quantization precision is high, which may affect the operation efficiency of the neural network, in this case, the data bit width may be adjusted to a shorter bit width to appropriately decrease the quantization precision and improve the operation efficiency of the neural network.); and
a step of clipping at least one of the quantized layer parameters ([0133] The error threshold may be determined based on the empirical value, and the error threshold may be used to indicate an expected value of the quantization error. When the quantization error is greater than or less than the error threshold, the data bit width corresponding to the data to be quantized may be adjusted to obtain the adjusted bit width corresponding to the data to be quantized. The step of clipping at least one of the quantized layer parameters data bit width may be adjusted to a longer bit width or a shorter bit width to increase or decrease the quantization precision.),
the at least one of the quantized layer parameters lying outside a range defined between the first and second clip thresholds; and an indicator is an error due to at least one of the quantizing step and the clipping step ([0135] The data bit width may be adjusted according to a stride of fixed bits, or the data bit width may be adjusted according to a variable adjustment stride based on the indicator is an error due to at least one of the quantizing step and the clipping step difference between the quantization error and the error threshold, which is not limited in the present disclosure.; [0136] The step S16 includes: the at least one of the quantized layer parameters lying outside a range defined between the first and second clip thresholds updating the data bit width corresponding to each piece of data to be quantized to the corresponding adjusted bit width, and computing a corresponding adjusted quantization parameter according to each piece of data to be quantized and the corresponding adjusted bit width to quantize each piece of data to be quantized according to the corresponding adjusted quantization parameter.).
Liu, Ha, and Mathew are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 8, Liu, as modified by Ha and Mathew, teaches The method of claim 6.
Liu teaches wherein: the sequential layers include an output layer; and the indicator is a recognition accuracy of the output layer ([0033] It should be understood that the quantization precision refers to the magnitude of an error between the quantized data and the pre-quantized data, and the quantization precision may affect the accuracy of the operation result of the neural network. The higher the quantization precision is, the higher the accuracy of the operation result may be, but the amount of operation and the memory access overhead may also be larger.; [0137] The data to be quantized may be the sequential layers include an output layer re-quantized according to the adjusted quantization parameter corresponding to the data to be quantized to obtain the quantized data with higher or lower quantization precision, so that a balance between the quantization precision and the processing efficiency may be achieved in the layer to be quantized.; [0141] In the embodiment, the data bit width may be adjusted according to the error between the data to be quantized and the quantized data corresponding to the data to be quantized, and the adjusted quantization parameter may be obtained by computing according to the adjusted data bit width. By setting indicator is a recognition accuracy of the output layer different error thresholds, different adjusted quantization parameters may be obtained to achieve different quantization requirements such as improving quantization precision or improving the operation efficiency.).
Liu, Ha, and Mathew are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 9, Liu, as modified by Ha and Mathew, teaches The method of claim 1.
Liu teaches wherein: the layer parameters include the weights of the reference layer ([0121] The correspondence between the data to be quantized and the quantization parameter may include correspondences between a plurality of pieces of data to be quantized and a plurality of quantization parameters corresponding thereto. For example, the correspondence A between the data to be quantized and the layer parameters include the weights of the reference layer quantization parameter includes: two pieces of data to be quantized including a neuron and a weight in a layer 1 to be quantized, three quantization parameters including a point position 1, a scaling factor 1, and an offset 1 corresponding to the neuron, and two quantization parameters including a point position 2 and an offset 2 corresponding to the weight.).
Liu, Ha, and Mathew are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 15, Liu, as modified by Ha and Mathew, teaches The apparatus of claim 10.
Liu teaches the statistical information represents an indicator indicative of a level of optimization of the reference layer; and the determiner is configured to determine the quantization range for the layer parameters related to the quantization target layer to thereby maximize the indicator ([0036] In a possible implementation manner, the processors mentioned in the present disclosure may include a plurality of processing units, and each processing unit may independently execute various assigned tasks, such as convolution operation task, pooling task, or fully connected task, etc. The present disclosure does not limit the processing unit and the tasks executed by the processing unit.; [0132] The step S15 includes: statistical information represents an indicator indicative of a level of optimization of the reference layer according to the quantization error and an error threshold corresponding to each piece of data to be quantized, the determiner is configured to determine the quantization range for the layer parameters related to the quantization target layer to thereby maximize the indicator adjusting the data bit width of each piece of data to be quantized to obtain an adjusted bit width corresponding to each piece of data to be quantized.; [0133] The error threshold may be determined based on the empirical value, and the error threshold may be used to indicate an expected value of the quantization error. When the quantization error is greater than or less than the error threshold, the data bit width corresponding to the data to be quantized may be adjusted to obtain the adjusted bit width corresponding to the data to be quantized. The data bit width may be adjusted to a longer bit width or a shorter bit width to increase or decrease the quantization precision.).
Liu, Ha, and Mathew are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 16, Liu, as modified by Ha and Mathew, teaches The apparatus of claim 10.
Liu teaches wherein: the quantizer is configured to: determine first and second clip thresholds based on the quantization range ([0132] The step S15 includes: according to the quantization error and an error threshold corresponding to each piece of data to be quantized, adjusting the data bit width of each piece of data to be quantized to obtain an adjusted bit width corresponding to each piece of data to be quantized.; [0134] The wherein: the quantizer is configured to: determine first and second clip thresholds based on the quantization range error threshold may be determined according to the maximum acceptable error. When the quantization error is greater than the error threshold, it means that the quantization precision cannot meet expectations, and the data bit width needs to be adjusted to a longer bit width. A smaller error threshold may be determined based on a higher quantization precision. When the quantization error is less than the error threshold, it means that the quantization precision is high, which may affect the operation efficiency of the neural network, in this case, the data bit width may be adjusted to a shorter bit width to appropriately decrease the quantization precision and improve the operation efficiency of the neural network.); and
clip at least one of the quantized layer parameters ([0133] The error threshold may be determined based on the empirical value, and the error threshold may be used to indicate an expected value of the quantization error. When the quantization error is greater than or less than the error threshold, the data bit width corresponding to the data to be quantized may be adjusted to obtain the adjusted bit width corresponding to the data to be quantized. The clip at least one of the quantized layer parameters data bit width may be adjusted to a longer bit width or a shorter bit width to increase or decrease the quantization precision.),
the at least one of the quantized layer parameters lying outside a range defined between the first and second clip thresholds; and an indicator is an error due to at least one of the quantizing step and the clipping step ([0135] The data bit width may be adjusted according to a stride of fixed bits, or the data bit width may be adjusted according to a variable adjustment stride based on the indicator is an error due to at least one of the quantizing step and the clipping step difference between the quantization error and the error threshold, which is not limited in the present disclosure.; [0136] The step S16 includes: the at least one of the quantized layer parameters lying outside a range defined between the first and second clip thresholds updating the data bit width corresponding to each piece of data to be quantized to the corresponding adjusted bit width, and computing a corresponding adjusted quantization parameter according to each piece of data to be quantized and the corresponding adjusted bit width to quantize each piece of data to be quantized according to the corresponding adjusted quantization parameter.).
Liu, Ha, and Mathew are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 17, Liu, as modified by Ha and Mathew, teaches The apparatus of claim 15.
Liu teaches wherein: the sequential layers include an output layer; and the indicator is a recognition accuracy of the output layer ([0033] It should be understood that the quantization precision refers to the magnitude of an error between the quantized data and the pre-quantized data, and the quantization precision may affect the accuracy of the operation result of the neural network. The higher the quantization precision is, the higher the accuracy of the operation result may be, but the amount of operation and the memory access overhead may also be larger.; [0137] The data to be quantized may be the sequential layers include an output layer re-quantized according to the adjusted quantization parameter corresponding to the data to be quantized to obtain the quantized data with higher or lower quantization precision, so that a balance between the quantization precision and the processing efficiency may be achieved in the layer to be quantized.; [0141] In the embodiment, the data bit width may be adjusted according to the error between the data to be quantized and the quantized data corresponding to the data to be quantized, and the adjusted quantization parameter may be obtained by computing according to the adjusted data bit width. By setting indicator is a recognition accuracy of the output layer different error thresholds, different adjusted quantization parameters may be obtained to achieve different quantization requirements such as improving quantization precision or improving the operation efficiency.).
Liu, Ha, and Mathew are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 18, Liu, as modified by Ha and Mathew, teaches The apparatus of claim 10.
Liu teaches wherein: the layer parameters include the weights of the reference layer ([0121] The correspondence between the data to be quantized and the quantization parameter may include correspondences between a plurality of pieces of data to be quantized and a plurality of quantization parameters corresponding thereto. For example, the correspondence A between the data to be quantized and the layer parameters include the weights of the reference layer quantization parameter includes: two pieces of data to be quantized including a neuron and a weight in a layer 1 to be quantized, three quantization parameters including a point position 1, a scaling factor 1, and an offset 1 corresponding to the neuron, and two quantization parameters including a point position 2 and an offset 2 corresponding to the weight.).
Liu, Ha, and Mathew are combinable for the same rationale as set forth above with respect to claim 1.
Claims 3, 5, 12, 14, 20, 22 are rejected under 35 U.S.C. 103 as being unpatentable over Liu, Ha, Mathew, and further in view of Carbon et al. (U.S. Pre-Grant Publication No. 20180373977, hereinafter 'Carbon').
Regarding claim 3, Liu, as modified by Ha and Mathew, teaches The method of claim 1.
Liu, as modified by Ha and Mathew, fails to teach wherein: the reference layer is an activation layer located subsequent to the quantization target layer, the activation layer having an activation function, and being configured to apply the activation function to the layer parameters related to the quantization target layer; the statistical information represents at least one saturation region included in an input-output characteristic of the activation function; and the determining step determines the quantization range for the layer parameters related to the quantization target layer such that at least part of a distribution range of the layer parameters related to the quantization target layer is excluded from the quantization range for the layer parameters related to the quantization target layer, the excluded part of the distribution range of the layer parameters related to the quantization target layer matching a majority part of at least one saturation region of the activation function.
Carbon teaches wherein: the reference layer is an activation layer located subsequent to the quantization target layer, the activation layer having an activation function ([0004] The description of this processing chain shows that the principal operations are accumulations of multiplications (weighted sums) between convolution filter coefficients or synaptic weights and input data for the layers. This is the case, for example, for convolution layers of the network input (coefficients) or for “fully-connected” layers (synaptic weights). In this example, a the reference layer is an activation layer located subsequent to the quantization target layer “fully-connected” layer is made up of neurons whose inputs are connected to all the outputs of the previous layer. These operations may be called MAC below for multiplication and accumulation.; [0005] FIG. 1 illustrates the modeling of a formal neuron, typically used in a deep neural network, implementing this series of MACs followed by a nonlinear function. The jth formal neuron of a the activation layer having an activation function neuron layer forms a sum 11 of the input values 12, x1 to xn, weighted with synaptic weights 13, w1j to wnj, and finally applies an activation function 14, which is the nonlinear function.), and
being configured to apply the activation function to the layer parameters related to the quantization target layer ([0038] By way of example, the shift(s) applied in order to perform said activation function is or are deduced from said position or from a previously saved most significant bit position, or from any parameter fixed by the user or contained in memory or previously calculated.);
the statistical information represents at least one saturation region included in an input-output characteristic of the activation function ([0036] By way of example, said method performs, moreover, a linear rectification on the datum present at the output of the saturation operation, said datum being the result of a saturation performed directly on said weighted sum, said linear rectification carrying out said activation function.; [0067] The statistical information represents at least one saturation region included in an input-output characteristic of the activation function value of the shift and/or of the saturation is controlled by a 5-bit word at the input, for the example of an accumulated 32-bit datum, indicating the shift to be performed between 0 and 31 bits, as well as a control word on 2 bits, allowing indication of whether a shift to the right, to the left, a saturation or a linear rectifier is involved. The shift, encoded on 5 bits, can be specified directly by the user via the “immediate” field of a control instruction or can come from the memory or from a backup register via the associated control instruction (value placed in memory or previously calculated), or can advantageously be deduced from the position of the MSB of the accumulated word, as contained on the MSB register 35, thus allowing saturation and/or direct normalization thereof to be performed in the next cycle.); and
the determining step determines the quantization range for the layer parameters related to the quantization target layer such that at least part of a distribution range of the layer parameters related to the quantization target layer is excluded from the quantization range for the layer parameters related to the quantization target layer ([0026] By way of example, said determining step determines the quantization range for the layer parameters related to the quantization target layer such that at least part of a distribution range of the layer parameters related to the quantization target layer is excluded from the quantization range for the layer parameters related to the quantization target layer position of the most significant bit is transmitted to the shift unit, the shift applied being deduced from said position, or from a previously saved most significant bit position or from any parameter fixed by the user or contained in memory or previously calculated.; [0027]
By way of example, said position of the most significant bit is transmitted to the saturation unit, the shift applied for the saturation operation being deduced from said position or from a previously saved most significant bit position, or from any parameter fixed by the user or contained in memory or previously calculated.),
the excluded part of the distribution range of the layer parameters related to the quantization target layer matching a majority part of at least one saturation region of the activation function ([0028] By way of example, said excluded part of the distribution range of the layer parameters related to the quantization target layer matching a majority part of at least one saturation region of the activation function activation unit is capable of approximating a radial basis function, from a sequence of shift operations performed by said shift unit as well as by means of saturations performed by said saturation unit. By way of example, the value of each shift is deduced from said position of the most significant bit of the shionfted datum or from a previously saved bit position, or from any parameter fixed by the user or contained in memory or previously calculated).
Liu, Ha, Mathew, and Carbon are considered to be analogous to the claimed invention because they are in the same field of machine learning. In view of the teachings of Liu, Ha, and Mathew, it would have been obvious for a person of ordinary skill in the art to apply the teachings of Carbon to Liu before the effective filing date of the claimed invention in order to allow the effective implementation on silicon of deep layer neural networks, tolerant to a low precision of the manipulated data but not to the accumulation of errors, requiring numerous multiplications/accumulations and nonlinear functions, all of this with variable precision of the intermediate data (cf. Carbon, [0017] The aim of the invention is, in particular, to solve these problems. In other words, its aim is to allow the effective implementation on silicon of deep layer neural networks, tolerant to a low precision of the manipulated data but not to the accumulation of errors, requiring numerous multiplications/accumulations and nonlinear functions, all of this with variable precision of the intermediate data. This effectiveness results in particular in a small occupied silicon surface area, a low power consumption and in the use of standard logic cells (which are available in all conventional technologies).).
Regarding claim 5, Liu, as modified by Ha, Mathew, and Carbon, teaches The method of claim 3.
Carbon teaches wherein: the activation function has a non-linear function that has at least one non-saturation region in the input-output characteristic thereof ([0005] FIG. 1 illustrates the modeling of a formal neuron, typically used in a deep neural network, implementing this series of MACs followed by a nonlinear function. The jth formal neuron of a neuron layer forms a sum 11 of the input values 12, x1 to xn, weighted with synaptic weights 13, w1j to wnj, and finally applies an activation function has a non-linear function activation function 14, which is the nonlinear function.; [0022] In a particular embodiment, the has at least one non-saturation region in the input-output characteristic activation unit has a saturation unit capable of performing a saturation operation directly on said weighted sum or on a temporary result of the activation function before saturation, the saturated datum being delivered at the output of said activation unit.).
Liu, Ha, Mathew, and Carbon are combinable for the same rationale as set forth above with respect to claim 3.
Regarding claim 12, Liu, as modified by Ha and Mathew, teaches The apparatus of claim 10.
Liu, as modified by Ha and Mathew, fails to teach wherein: the reference layer is an activation layer located subsequent to the quantization target layer, the activation layer having an activation function, and being configured to apply the activation function to the layer parameters related to the quantization target layer; the statistical information represents at least one saturation region included in an input-output characteristic of the activation function; and the determiner is configured to determine the quantization range for the layer parameters related to the quantization target layer such that at least part of a distribution range of the layer parameters related to the quantization target layer is excluded from the quantization range for the layer parameters related to the quantization target layer, the excluded part of the distribution range of the layer parameters related to the quantization target layer matching a majority part of at least one saturation region of the activation function.
Carbon teaches wherein: the reference layer is an activation layer located subsequent to the quantization target layer, the activation layer having an activation function ([0004] The description of this processing chain shows that the principal operations are accumulations of multiplications (weighted sums) between convolution filter coefficients or synaptic weights and input data for the layers. This is the case, for example, for convolution layers of the network input (coefficients) or for “fully-connected” layers (synaptic weights). In this example, a the reference layer is an activation layer located subsequent to the quantization target layer “fully-connected” layer is made up of neurons whose inputs are connected to all the outputs of the previous layer. These operations may be called MAC below for multiplication and accumulation.; [0005] FIG. 1 illustrates the modeling of a formal neuron, typically used in a deep neural network, implementing this series of MACs followed by a nonlinear function. The jth formal neuron of a the activation layer having an activation function neuron layer forms a sum 11 of the input values 12, x1 to xn, weighted with synaptic weights 13, w1j to wnj, and finally applies an activation function 14, which is the nonlinear function.), and
being configured to apply the activation function to the layer parameters related to the quantization target layer ([0038] By way of example, the shift(s) applied in order to perform said activation function is or are deduced from said position or from a previously saved most significant bit position, or from any parameter fixed by the user or contained in memory or previously calculated.);
the statistical information represents at least one saturation region included in an input-output characteristic of the activation function ([0036] By way of example, said method performs, moreover, a linear rectification on the datum present at the output of the saturation operation, said datum being the result of a saturation performed directly on said weighted sum, said linear rectification carrying out said activation function.; [0067] The statistical information represents at least one saturation region included in an input-output characteristic of the activation function value of the shift and/or of the saturation is controlled by a 5-bit word at the input, for the example of an accumulated 32-bit datum, indicating the shift to be performed between 0 and 31 bits, as well as a control word on 2 bits, allowing indication of whether a shift to the right, to the left, a saturation or a linear rectifier is involved. The shift, encoded on 5 bits, can be specified directly by the user via the “immediate” field of a control instruction or can come from the memory or from a backup register via the associated control instruction (value placed in memory or previously calculated), or can advantageously be deduced from the position of the MSB of the accumulated word, as contained on the MSB register 35, thus allowing saturation and/or direct normalization thereof to be performed in the next cycle.); and
the determiner is configured to determine the quantization range for the layer parameters related to the quantization target layer such that at least part of a distribution range of the layer parameters related to the quantization target layer is excluded from the quantization range for the layer parameters related to the quantization target layer ([0026] By way of example, said the determiner is configured to determine the quantization range for the layer parameters related to the quantization target layer such that at least part of a distribution range of the layer parameters related to the quantization target layer is excluded from the quantization range for the layer parameters related to the quantization target layer position of the most significant bit is transmitted to the shift unit, the shift applied being deduced from said position, or from a previously saved most significant bit position or from any parameter fixed by the user or contained in memory or previously calculated.; [0027] By way of example, said position of the most significant bit is transmitted to the saturation unit, the shift applied for the saturation operation being deduced from said position or from a previously saved most significant bit position, or from any parameter fixed by the user or contained in memory or previously calculated.),
the excluded part of the distribution range of the layer parameters related to the quantization target layer matching a majority part of at least one saturation region of the activation function ([0028] By way of example, said excluded part of the distribution range of the layer parameters related to the quantization target layer matching a majority part of at least one saturation region of the activation function activation unit is capable of approximating a radial basis function, from a sequence of shift operations performed by said shift unit as well as by means of saturations performed by said saturation unit. By way of example, the value of each shift is deduced from said position of the most significant bit of the shionfted datum or from a previously saved bit position, or from any parameter fixed by the user or contained in memory or previously calculated).
Liu, Ha, Mathew, and Carbon are combinable for the same rationale as set forth above with respect to claim 3.
Regarding claim 14, Liu, as modified by Ha, Mathew, and Carbon, teaches The apparatus of claim 12.
Carbon teaches wherein: the activation function has a non-linear function that has at least one non-saturation region in the input-output characteristic thereof ([0005] FIG. 1 illustrates the modeling of a formal neuron, typically used in a deep neural network, implementing this series of MACs followed by a nonlinear function. The jth formal neuron of a neuron layer forms a sum 11 of the input values 12, x1 to xn, weighted with synaptic weights 13, w1j to wnj, and finally applies an activation function has a non-linear function activation function 14, which is the nonlinear function.; [0022] In a particular embodiment, the has at least one non-saturation region in the input-output characteristic activation unit has a saturation unit capable of performing a saturation operation directly on said weighted sum or on a temporary result of the activation function before saturation, the saturated datum being delivered at the output of said activation unit.).
Liu, Ha, Mathew, and Carbon are combinable for the same rationale as set forth above with respect to claim 3.
Regarding claim 20 and analogous claim 22, Liu, as modified by Ha and Mathew, teaches The method of claim 1 and The apparatus of claim 10, respectively.
Mathew teaches exclusion of the at least part of the distribution range of the layer parameters related to the quantization target layer from the quantization range for the layer parameters related to the quantization target layer resulting in an elimination of an activation task based on the activation layer during neural-network inference ([0024] Inserting the PACT2 function serves two purposes. First, it exclusion of the at least part of the distribution range of the layer parameters related to the quantization target layer from the quantization range for the layer parameters related to the quantization target layer clips the activation feature map so that it is contained within a certain range. Second, the PACT2 clip function becomes part of the model graph (for example, the ONNX model; ONNX is an open format for machine learning (ML) models, allowing models to be interchanged between various ML frameworks and tools) and it is easy for the inference engine to understand what the clipping range is. Once these clipping values are known, the quantization scale factors can be derived easily.; [0028] PACT2u is a resulting in an elimination of an activation task based on the activation layer during neural-network inference replacement for ReLU activation function due to the unsigned nature of activations. There may be several places in a model where the activations need to be quantized, but there is no ReLU at that place. For example, this happens in the linear bottleneck of the MobileNetV2 models where the convolution+batch normalization has no ReLU following it. This also happens in the ResNet models before the element-wise addition of the residual block. PACT2s can be inserted in the model in those places. Essentially, a PACT2s may be inserted in a model wherever there is a need to quantize the feature map and the feature map is signed. Collectively, the signed and unsigned versions are referred to as “PACT2” herein.).
Liu, as modified by Ha and Mathew, fails to teach wherein: the reference layer is an activation layer located subsequent to the quantization target layer, the activation layer having an activation function, and being configured to apply the activation function to the layer parameters related to the target quantization target layer; the statistical information represents at least one saturation region included in an input- output characteristic of the activation function; and the determining step determines the quantization range for the layer parameters related to the quantization target layer such that at least part of a distribution range of the layer parameters related to the quantization target layer is excluded from the quantization range for the layer parameters related to the quantization target layer, the excluded part of the distribution range of the layer parameters related to the quantization target layer matching a majority part of at least one saturation region of the activation function,
Carbon teaches wherein: the reference layer is an activation layer located subsequent to the quantization target layer, the activation layer having an activation function ([0004] The description of this processing chain shows that the principal operations are accumulations of multiplications (weighted sums) between convolution filter coefficients or synaptic weights and input data for the layers. This is the case, for example, for convolution layers of the network input (coefficients) or for “fully-connected” layers (synaptic weights). In this example, a the reference layer is an activation layer located subsequent to the quantization target layer “fully-connected” layer is made up of neurons whose inputs are connected to all the outputs of the previous layer. These operations may be called MAC below for multiplication and accumulation.; [0005] FIG. 1 illustrates the modeling of a formal neuron, typically used in a deep neural network, implementing this series of MACs followed by a nonlinear function. The jth formal neuron of a the activation layer having an activation function neuron layer forms a sum 11 of the input values 12, x1 to xn, weighted with synaptic weights 13, w1j to wnj, and finally applies an activation function 14, which is the nonlinear function.), and
being configured to apply the activation function to the layer parameters related to the target quantization target layer ([0038] By way of example, the shift(s) applied in order to perform said activation function is or are deduced from said position or from a previously saved most significant bit position, or from any parameter fixed by the user or contained in memory or previously calculated.);
the statistical information represents at least one saturation region included in an input-output characteristic of the activation function ([0036] By way of example, said method performs, moreover, a linear rectification on the datum present at the output of the saturation operation, said datum being the result of a saturation performed directly on said weighted sum, said linear rectification carrying out said activation function.; [0067] The statistical information represents at least one saturation region included in an input-output characteristic of the activation function value of the shift and/or of the saturation is controlled by a 5-bit word at the input, for the example of an accumulated 32-bit datum, indicating the shift to be performed between 0 and 31 bits, as well as a control word on 2 bits, allowing indication of whether a shift to the right, to the left, a saturation or a linear rectifier is involved. The shift, encoded on 5 bits, can be specified directly by the user via the “immediate” field of a control instruction or can come from the memory or from a backup register via the associated control instruction (value placed in memory or previously calculated), or can advantageously be deduced from the position of the MSB of the accumulated word, as contained on the MSB register 35, thus allowing saturation and/or direct normalization thereof to be performed in the next cycle.); and
the determining step determines the quantization range for the layer parameters related to the quantization target layer such that at least part of a distribution range of the layer parameters related to the quantization target layer is excluded from the quantization range for the layer parameters related to the quantization target layer ([0026] By way of example, said determining step determines the quantization range for the layer parameters related to the quantization target layer such that at least part of a distribution range of the layer parameters related to the quantization target layer is excluded from the quantization range for the layer parameters related to the quantization target layer position of the most significant bit is transmitted to the shift unit, the shift applied being deduced from said position, or from a previously saved most significant bit position or from any parameter fixed by the user or contained in memory or previously calculated.; [0027]
By way of example, said position of the most significant bit is transmitted to the saturation unit, the shift applied for the saturation operation being deduced from said position or from a previously saved most significant bit position, or from any parameter fixed by the user or contained in memory or previously calculated.),
the excluded part of the distribution range of the layer parameters related to the quantization target layer matching a majority part of at least one saturation region of the activation function ([0028] By way of example, said excluded part of the distribution range of the layer parameters related to the quantization target layer matching a majority part of at least one saturation region of the activation function activation unit is capable of approximating a radial basis function, from a sequence of shift operations performed by said shift unit as well as by means of saturations performed by said saturation unit. By way of example, the value of each shift is deduced from said position of the most significant bit of the shionfted datum or from a previously saved bit position, or from any parameter fixed by the user or contained in memory or previously calculated),
Liu, Ha, Mathew, and Carbon are combinable for the same rationale as set forth above with respect to claim 3.
Claims 4, 13 are rejected under 35 U.S.C. 103 as being unpatentable over Liu, Ha, Mathew, in view of Carbon, and further in view of Liang et al. (U.S. Pre-Grant Publication No. 20230252757, hereinafter 'Liang').
Regarding claim 4, Liu, as modified by Ha, Mathew, and Carbon, teaches The method of claim 3.
Liu, as modified by Ha, Mathew, and Carbon, fails to teach wherein: the activation function has a linear function that has at least one non-saturation region in the input-output characteristic thereof.
Liang teaches wherein: the activation function has a linear function that has at least one non-saturation region in the input-output characteristic thereof ([0050] Specifically, current quantization techniques include linear mapping and non-linear mapping. In the present invention, the focus is on the shortcomings of the prior art in the the activation function has a linear function linear mapping process and the solutions provided in the present invention.; [0053] specifically, as in step S1, in consideration that the network model is quantified using the quantization threshold value calculated through the single saturated mapping method or the unsaturated mapping method in the prior art will reduce the accuracy of its image processing, the second quantization threshold has at least one non-saturation region in the input-output characteristic thereof value of the activations can be calculated using the unsaturated mapping method in this step, in order to serve as the data base for calculating the optimal quantization threshold value in the subsequent steps.).
Liu, Ha, Mathew, Carbon, and Liang are considered to be analogous to the claimed invention because they are in the same field of machine learning. In view of the teachings of Liu, Ha, Mathew, and Carbon, it would have been obvious for a person of ordinary skill in the art to apply the teachings of Liang to Liu before the effective filing date of the claimed invention in order to improve the accuracy of image processing and improve the accuracy of inference computations of the quantified deep neural network on low-bit-width hardware platforms (cf. Liang, [0006] The objective of the present invention is to provide an image processing method, to improve the accuracy of image processing; and another objective of the present invention provides an imaging processing device and apparatus, to improve the accuracy of image processing.; [0039] The present invention provides an image processing method, since in the present invention, the first quantization threshold value obtained by the saturated mapping method and the second quantization threshold value obtained by the unsaturated mapping method are weighted, it is equivalent to fusing two quantization threshold values. The obtained optimal quantization threshold value can be applied to most activations, to more effectively retain the effective information of the activations and use in subsequent image processing, thereby improving the accuracy of inference computations of the quantified deep neural network on low-bit-width hardware platforms.).
Regarding claim 13, Liu, as modified by Ha, Mathew, and Carbon, teaches The apparatus of claim 12.
Liu, as modified by Ha, Mathew, and Carbon, fails to teach wherein: the activation function has a linear function that has at least one non-saturation region in the input-output characteristic thereof.
Liang teaches wherein: the activation function has a linear function that has at least one non-saturation region in the input-output characteristic thereof ([0050] Specifically, current quantization techniques include linear mapping and non-linear mapping. In the present invention, the focus is on the shortcomings of the prior art in the the activation function has a linear function linear mapping process and the solutions provided in the present invention.; [0053] specifically, as in step S1, in consideration that the network model is quantified using the quantization threshold value calculated through the single saturated mapping method or the unsaturated mapping method in the prior art will reduce the accuracy of its image processing, the second quantization threshold has at least one non-saturation region in the input-output characteristic thereof value of the activations can be calculated using the unsaturated mapping method in this step, in order to serve as the data base for calculating the optimal quantization threshold value in the subsequent steps.).
Liu, Ha, Mathew, Carbon, and Liang are combinable for the same rationale as set forth above with respect to claim 4.
Claims 21, 23 are rejected under 35 U.S.C. 103 as being unpatentable over Liu, Ha, Mathew, and further in view of Cardinaux et al. (NPL: "Iteratively Training Look-Up Tables for Network Quantization", hereinafter 'Cardinaux').
Regarding claim 21 and analogous claim 23, Liu, as modified by Ha and Mathew, teaches The method of claim 1 and The apparatus of claim 10, respectively.
Liu, as modified by Ha and Mathew, fails to teach wherein the activation layer includes a lookup table, the activation layer performs, using the lookup table, an activation task and unequal-interval quantization on the layer parameters, each of which has the number of bits lowered by the quantizing, thus further reducing the number of bits of each of the layer parameters.
Cardinaux teaches wherein the activation layer includes a lookup table, the activation layer performs, using the lookup table, an activation task and unequal-interval quantization on the layer parameters, each of which has the number of bits lowered by the quantizing, thus further reducing the number of bits of each of the layer parameters ([I. INTRODUCTION, pg. 860] In this article we discuss a general framework for network reduction which we call Look-Up Table Quantization (LUT-Q). Primarily, LUT-Q is a non-uniform quantization method for DNNs, which uses learned dictionaries d ∈ RK and lookup tables A ∈{1,...,K}O×I to represent the network weights W ∈ RO×I, i.e., we use W ∈{X : [X]oi =[d][A]oi , d ∈ RK, A ∈ {1,...,K}O×I}. In this article, we show that each of which has the number of bits lowered by the quantizing, thus further reducing the number of bits of each of the layer parameters LUT-Q is a very flexible tool which allows for an easy combination of non-uniform quantization with other reduction methods like pruning. With LUT-Q, we can easily train networks with highly structured weight matrices W, by imposing constraints on the dictionary vector d or the assignment matrix A. For example, a dictionary vector d with K elements results in quantized weights which can be encoded with log2(K)+32Kbit. Alternatively, we can constrain the d to contain only the values {−1,1} and obtain a Binary Connect Network[14],or to{−1,0,1}resulting in a Ternary Weight Network [15]. This flexibility of our LUT-Q method allows us to use the same method to train networks for different hardware capabilities. Moreover, we show that LUT-Q benefits from optimized dictionary values, compared to other approaches which use predefined values (e.g. [14]–[17]).; [III. LOOK-UP TABLE QUANTIZATION NETWORKS, pg. 861] We consider training and inference of DNNs with wherein the activation layer includes a lookup table LUT-Q layers, i.e., layers which compute Q=LUTQ(W) y =Φ(Qx+b), (1) (2) where x ∈ RI is the input vector, y ∈ RO is the output vector, W∈RO×I is the unquantized weight matrix, Q ∈ RO×I is the quantized weight matrix, b ∈ RO is the bias vector and Φ:RO →RO is the activation function of the layer.1 LUTQ : RO×I →RO×I is the look-up table quantization operation, which computes LUTQ(W) =lookup(d,A), (3) where lookup(d,A) is the table look-up operation that uses the elements of A to index into the dictionary d, i.e., [lookup(d,A)]oi =[d][A]oi . (4) At each forward pass, LUTQ(·) first computes an optimal dictionary d ∈ RK and an assignment matrix A ∈{1,...,K}O×I, which fits best to the current weight matrix W. 1 d,A =argmin d,A 2 ||W −lookup(d,A)||2 (5) Then, the the activation layer performs, using the lookup table, an activation task and unequal-interval quantization on the layer parameters layer applies the lookup (d,A) to obtain the quantized representation Q of W and calculates the activation y.).
Liu, Ha, Mathew, and Cardinaux are considered to be analogous to the claimed invention because they are in the same field of machine learning. In view of the teachings of Liu, Ha, and Mathew, it would have been obvious for a person of ordinary skill in the art to apply the teachings of Cardinaux to Liu before the effective filing date of the claimed invention in order to implement non-uniform network quantization, training of multiplier less networks, network pruning, or simultaneous quantization and pruning without changing the solver (cf. Cardinaux, [Abstract, pg. 860] Operating deep neural networks (DNNs) on devices with limited resources requires the reduction of their memory as well as computational footprint. Popular reduction methods are network quantization or pruning, which either reduce the word length of the network parameters or remove weights from the network if they are not needed. In this article, we discuss a general framework for network reduction which we call Look-Up Table Quantization (LUT-Q). For each layer, we learn a value dictionary and an assignment matrix to represent the network weights. We propose a special solver which combines gradient descent and a one-step k-means update to learn both the value dictionaries and assignment matrices iteratively. This method is very flexible: by constraining the value dictionary, many different reduction problems such as non-uniform network quantization, training of multiplier less networks, network pruning, or simultaneous quantization and pruning can be implemented without changing the solver. This flexibility of the LUT-Q method allows us to use the same method to train networks for different hardware capabilities.).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Ramachandran et al. (U.S. Pre-Grant Publication No. 20210406690) teaches systems, apparatuses, and methods for implementing one-sided per-kernel clipping and weight transformation for neural networks.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MAGGIE MAIDO whose telephone number is (703) 756-1953. The examiner can normally be reached M-Th: 6am - 4pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael Huntley can be reached on (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/MM/Examiner, Art Unit 2129
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129