DETAILED ACTION
Claims 1, 4, 7, 9 and 14 have been amended.
No new claims have been added.
Claims 1-20 are pending.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant’s arguments, see pgs. 7-11, filed 10/30/2025, with respect to claims 1-20 have been fully considered and are persuasive. The 101 rejection of 07/02/2025 has been withdrawn.
Applicant's arguments filed 10/30/2025 have been fully considered but they are not persuasive.
In regard to the Rejections under 35 USC 103, see Applicant’s Remarks pgs. 11-15, Applicant’s arguments with respect to claims 1-20 have been considered. Applicant argues that “…the applied references including Zivkovic and Nagel, whether taken individually or in combination, fail to teach the claimed subject matter that extracts a plurality of subsets of quantization points from a candidate set of quantization points and determine an optimal set of quantization points by calculating a quantization loss for each subset, thereby reducing the size of a neural network model and minimizing memory usage, as supported by at least paragraphs [0115] and [0135] of the specification. This adaptive process is critical to the claimed technical improvement in model size reduction and neural network operation performance ([0142] and [0153] of the specification).” Examiner would like to point out that, the prior art Nagel is used to reject the calculating of the quantization loss. A explained in paragraph 0026, the quantization loss is explained as one of the five losses in the process of Nagel. Then further in paragraph 0066, performing operations to reduce quantization errors is further explain the quantization loss.
Specifically:
(Nagel, paragraph 0026, “FIG. 1B illustrates that there are five types of loss in the fixed-point quantized pipeline, e.g., input quantization loss, weight quantization loss, runtime saturation loss, activation re-quantization loss, and possible clipping loss for certain non-linear operations. For these and other reasons, the quantization of large bit-width values into small bitwidth representations may introduce quantization noise on the weights and activations. That is, there may be a significant numerical difference between the parameters in a large bit-width model (e.g., float32, int32, etc.) and their small bit-width representations ( e.g., uint8 output value, etc.) [calculating, by a processor, a quantization loss, examiner interprets difference of the small-bit width to the large bit width as the loss being calculated]. This difference may have a significant negative impact on the efficiency, performance or functioning of the neural network on a small bit-width computing device.” And paragraph 0066, “The operations in the method 400 may be repeated layer by layer for all layers within the neural network (i.e., incrementing the index i with each repetition), performing cross-layer rescaling where necessary to equalize or normalize weights in any layers where necessary to remove outliers and reduce quantization errors [calculating]. The result of performing the operations in the method 400 may be a neural network that is suitable for implementation on a computing device configured with smaller bit-width capacity that was used to trained the neural network.”)
As for reducing memory size and improving the efficiency, the newly used prior art Li et al (An Efficient Multicast Router using Shared-Buffer with Packet Merging for Dataflow Architecture, “Li”) describes reducing the buffer size and that in turn allows the DPU to run more efficiently.
Specifically:
(Li, pg. 4, Col. 1, IV. MRSB, paragraph 1, “we first describe MRSB architecture that improves the efficient utilization of buffer resources in the router and then augment MRSB with packet merging that reduces buffer storage bandwidth waste. [reducing a buffer memory size used for storing weights,]” And pg. 8, Col. 1, Conclusion, paragraph 1, “For our experimental workloads, the performance of DPU using MRSB is 25.93% higher than DPU using a state-of-the-art router structure. [improving efficiency of the DPU that runs the neural network.]”)
Applicant also argues that the other references would fail to remedy the deficiencies of Zivkovic and Nagel to teach “extracting, by the processor, for the weight, a plurality of subsets of quantization points from the candidate set of quantization points based on the bitwidth; calculating, for each of the plurality of subsets of the weight, by the processor, a quantization loss based on the received weight and the subsets of quantization points; ... performing, by the processor, a neural network operation based on the received weight and the generated target subset that has the smallest quantization loss in the candidate set, thereby reducing a size of the neural network, reducing a buffer memory size used for storing weights, and improving efficiency of the DPU that runs the neural network," that are required of claim 1. Examiner would like to point out that extracting the candidate set is mapped to Zivkovic. Zivkovic teaches about fidning a subset based on the bit width of the accumulator.
Specifically:
(Zivkovic, Col. 12, paragraph 7, “Example 10 includes the method of any one of examples 6 to 9, including or excluding optional features. In this example, the quantization point is selected via a search procedure that is limited to a subset of possible quantization points, wherein the subset of possible quantization points are constrained by the accumulator bit width. [extracting a plurality of subsets of quantization points from the candidate set of quantization points based on the bitwidth;]”)
The calculating of the quantization loss is taught by Nagel. Nagel discusses the type of quantization losses that can be used and then further details calculating the errors of the quantization.
Specifically:
(Nagel, paragraph 0026, “FIG. 1B illustrates that there are five types of loss in the fixed-point quantized pipeline, e.g., input quantization loss, weight quantization loss, runtime saturation loss, activation re-quantization loss, and possible clipping loss for certain non-linear operations. For these and other reasons, the quantization of large bit-width values into small bitwidth representations may introduce quantization noise on the weights and activations. That is, there may be a significant numerical difference between the parameters in a large bit-width model (e.g., float32, int32, etc.) and their small bit-width representations ( e.g., uint8 output value, etc.) [calculating, by a processor, a quantization loss, examiner interprets difference of the small-bit width to the large bit width as the loss being calculated]. This difference may have a significant negative impact on the efficiency, performance or functioning of the neural network on a small bit-width computing device.” And paragraph 0066, “The operations in the method 400 may be repeated layer by layer for all layers within the neural network (i.e., incrementing the index i with each repetition), performing cross-layer rescaling where necessary to equalize or normalize weights in any layers where necessary to remove outliers and reduce quantization errors [calculating]. The result of performing the operations in the method 400 may be a neural network that is suitable for implementation on a computing device configured with smaller bit-width capacity that was used to trained the neural network.”)
The performing, by the processor, a neural network operation based on the received weight and the generated target subset that has the smallest quantization loss in the candidate set thereby reducing a size of the neural network, is taught by Nagel. Examiner would like to point out that Nagel teaches minimizing error which is interpreted as the smallest quantization loss in a neural network operation.
Specifically:
(Nagel, paragraph 0067, “FIG. 4B illustrates an additional operation neural network quantization [performing, by the processor, a neural network operation based on] method 410 that may be performed in some embodiments as part of performing cross layer rescaling in method 400 to improve quantization in accordance with some embodiments. In the method 400, the corresponding scaling factors are determined so that ranges of weight tensors and channel weights [received weight] within the i'th layer of the neural network may be equalized and outliers removed by scaling from the i'th layer to the adjacent (i.e., i+l) layer.” and paragraph 0068, “In block 412, the processor may determine the corresponding scaling factor used for scaling the i'th layer channel weights in block 402 so as to equalize the ranges within the weight tensor the i'th layer. In some embodiments, the corresponding scaling factor may be determined based on heuristics, equalization of dynamic ranges, equalization of range extrema (minima or maxima), differential learning using STE methods and a local or global loss, or by using a metric for the quantization error and a black box optimizer that minimizes the error metric due to quantization [the generated target subset that has the smallest quantization loss in the candidate set thereby reducing a size of the neural network.].”)
Lastly, Li teaches reducing a buffer memory size used for storing weights, and improving efficiency of the DPU that runs the neural network. Examiner would like to point out that Li teaches reducing a buffer storage memory for reducing the waste.
Specifically:
(Li, pg. 4, Col. 1, IV. MRSB, paragraph 1, “we first describe MRSB architecture that improves the efficient utilization of buffer resources in the router and then augment MRSB with packet merging that reduces buffer storage bandwidth waste. [reducing a buffer memory size used for storing weights,]” And pg. 8, Col. 1, Conclusion, paragraph 1, “For our experimental workloads, the performance of DPU using MRSB is 25.93% higher than DPU using a state-of-the-art router structure. [improving efficiency of the DPU that runs the neural network.]” Examiner would like to remind that the target subset with the smallest quantization is mapped to Nagel)
Therefore the 35 USC 103 rejection is maintained.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(d):
(d) REFERENCE IN DEPENDENT FORMS.—Subject to subsection (e), a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers.
The following is a quotation of pre-AIA 35 U.S.C. 112, fourth paragraph:
Subject to the following paragraph [i.e., the fifth paragraph of pre-AIA 35 U.S.C. 112], a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers.
Claim 13 rejected under 35 U.S.C. 112(d) or pre-AIA 35 U.S.C. 112, 4th paragraph, as being of improper dependent form for failing to further limit the subject matter of the claim upon which it depends, or for failing to include all the limitations of the claim upon which it depends. Claim 13 does not further limit claim 9 due to claim 9 already stating that the target quantization point is shared between multiply-accumulate (MAC) operators. Applicant may cancel the claim(s), amend the claim(s) to place the claim(s) in proper dependent form, rewrite the claim(s) in independent form, or present a sufficient showing that the dependent claim(s) complies with the statutory requirements.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or non-obviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1, 4-5, 7-8, 17-18 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Zivkovic et al (US Patent No. 9916531, "Zivkovic") [March 13, 2018], in view of Nagel et al (US Published Patent Application No. 20200302299, "Nagel") [Sept. 24, 2020] and in further view of Li et al (An Efficient Multicast Router using Shared-Buffer with Packet Merging for Dataflow Architecture, “Li”)[Sept. 25, 2020].
In regard to claim 1, Zivkovic teaches A processor-implemented neural network operation method using a deep learning processor unit (DPU), the method, comprising: (Zivkovic, pg. 9, Col. 3, paragraph 6, lines 64-69, “However, the accumulator 128 or a plurality of accumulators are often implemented as internal registers of the CPU or GPU. The accumulator 128 or a plurality of accumulators may also be a component of an internal register of a CNN accelerator, a digital signal processor (DSP), or the CNN controller 118.”)
receiving, by a receiver connected to a processor, a weight of a neural network, a candidate set of quantization points, and a bitwidth that represents the received weight; (Zivkovic, Col. 1, paragraph 1, “The neural network may be divided into layers, and each layer may include a plurality of nodes. Each node may function similarly to a neuron. In examples, a particular input value may be applied to any number of the nodes of the network. Each input value of each node is given a weight, [receiving a weight of a neural network,] and each node generates an output value that is a function of the sum of its weighted input values.” And Col. 5, paragraph 4, “Moreover, the present techniques introduce a number of constraints that will limit the number of quantization points to consider [a candidate set of quantization points]. Often, searching for an optimal quantization point is slow. A conventional quantization procedures often checks a large number of possible quantization options.” And Col. 7, paragraph 2, “To use a narrow accumulator, as described above a quantization for the network parameters is selected. The network parameter selection may begin with a minimal meaningful value of 1 bit. Often, a higher value may be used, such as 2 or 3 bits, as 1 bit might hold too little information for quantization. In examples, the starting point for parameter quantization is a number between 1 bit and accumulator bit width. [a bitwidth that represents the received weight]”)
extracting, by the processor, for the weight a plurality of subsets of quantization points from the candidate set of quantization points based on the bitwidth; (Zivkovic, Col. 12, paragraph 7, “Example 10 includes the method of any one of examples 6 to 9, including or excluding optional features. In this example, the quantization point is selected via a search procedure that is limited to a subset of possible quantization points, wherein the subset of possible quantization points are constrained by the accumulator bit width. [extracting a plurality of subsets of quantization points from the candidate set of quantization points based on the bitwidth;]”)
However, Zivkovic does not explicitly teach calculating, for each of the plurality of subsets of the weight, by the processor, a quantization loss based on the received weight and the subset of quantization points;
generating, by the processor, a target subset of quantization points based on the calculated quantization losses; and
performing, by the processor, a neural network operation based on the received weight and the generated target subset that has the smallest quantization loss in the candidate set thereby reducing a size of the neural network, reducing a buffer memory size used for storing weights, and improving efficiency of the DPU that runs the neural network.
Nagel teaches calculating, for each of the plurality of subsets of the weight, by the processor, a quantization loss based on the received weight and the subset of quantization points; (Nagel, paragraph 0026, “FIG. 1B illustrates that there are five types of loss in the fixed-point quantized pipeline, e.g., input quantization loss, weight quantization loss, runtime saturation loss, activation re-quantization loss, and possible clipping loss for certain non-linear operations. For these and other reasons, the quantization of large bit-width values into small bitwidth representations may introduce quantization noise on the weights and activations. That is, there may be a significant numerical difference between the parameters in a large bit-width model (e.g., float32, int32, etc.) and their small bit-width representations ( e.g., uint8 output value, etc.) [calculating, by a processor, a quantization loss, examiner interprets difference of the small-bit width to the large bit width as the loss being calculated]. This difference may have a significant negative impact on the efficiency, performance or functioning of the neural network on a small bit-width computing device.” And paragraph 0066, “The operations in the method 400 may be repeated layer by layer for all layers within the neural network (i.e., incrementing the index i with each repetition), performing cross-layer rescaling where necessary to equalize or normalize weights in any layers where necessary to remove outliers and reduce quantization errors [calculating]. The result of performing the operations in the method 400 may be a neural network that is suitable for implementation on a computing device configured with smaller bit-width capacity that was used to trained the neural network.”)
generating, by the processor, a target subset of quantization points based on the calculated quantization losses; and (Nagel, paragraph 0026, “FIG. 1B illustrates that there are five types of loss in the fixed-point quantized pipeline, e.g., input quantization loss, weight quantization loss, runtime saturation loss, activation re-quantization loss, and possible clipping loss for certain non-linear operations.”)
performing, by the processor, a neural network operation based on the received weight and the generated target subset that has the smallest quantization loss in the candidate set thereby reducing a size of the neural network, (Nagel, paragraph 0067, “FIG. 4B illustrates an additional operation neural network quantization [performing, by the processor, a neural network operation based on] method 410 that may be performed in some embodiments as part of performing cross layer rescaling in method 400 to improve quantization in accordance with some embodiments. In the method 400, the corresponding scaling factors are determined so that ranges of weight tensors and channel weights [received weight] within the i'th layer of the neural network may be equalized and outliers removed by scaling from the i'th layer to the adjacent (i.e., i+l) layer.” and paragraph 0068, “In block 412, the processor may determine the corresponding scaling factor used for scaling the i'th layer channel weights in block 402 so as to equalize the ranges within the weight tensor the i'th layer. In some embodiments, the corresponding scaling factor may be determined based on heuristics, equalization of dynamic ranges, equalization of range extrema (minima or maxima), differential learning using STE methods and a local or global loss, or by using a metric for the quantization error and a black box optimizer that minimizes the error metric due to quantization [the generated target subset that has the smallest quantization loss in the candidate set thereby reducing a size of the neural network.].”)
Zivkovic and Nagel are related to the same field of endeavor (i.e. quantization). In view of the teachings of Nagel, it would have been obvious for a person with ordinary skill in the art to apply the teachings of Nagel to Zivkovic before the effective filing date of the claimed invention in order to improve performance and accuracy of models. (Nagel, paragraph 0036, “For all the reasons discussed above, a computing device configured to perform cross layer rescaling operations in accordance with various embodiments may enhance or improve the performance, accuracy, and precision of the quantized models, and in turn, reduce the computational complexity associated with neural networks.”)
However, Zivkovic and Nagel do not explicitly teach reducing a buffer memory size used for storing weights, and improving efficiency of the DPU that runs the neural network.
Li teaches reducing a buffer memory size used for storing weights, and improving efficiency of the DPU that runs the neural network. (Li, pg. 4, Col. 1, IV. MRSB, paragraph 1, “we first describe MRSB architecture that improves the efficient utilization of buffer resources in the router and then augment MRSB with packet merging that reduces buffer storage bandwidth waste. [reducing a buffer memory size used for storing weights,]” And pg. 8, Col. 1, Conclusion, paragraph 1, “For our experimental workloads, the performance of DPU using MRSB is 25.93% higher than DPU using a state-of-the-art router structure. [improving efficiency of the DPU that runs the neural network.]” Examiner would like to remind that the target subset with the smallest quantization is mapped to Nagel)
Zivkovic, Nagel and Li are related to the same field of endeavor (i.e. neural networks). In view of the teachings of Li, it would have been obvious for a person with ordinary skill in the art to apply the teachings of Li to Zivkovic and Nagel before the effective filing date of the claimed invention in order to have a more effective buffer memory. (Li, abstract, “For our experimental workloads, experimental results show that MRSB is 221.48% higher effective buffer utilization and 32.98% less latency than a state-of-the-art router with 31.39% smaller area and 29.14% lower power.”)
In regard to claim 4 and analogous claim 17, Zivkovic, Nagel and Li teach the method of claim 1.
Zivkovic further teaches wherein the extracting of the plurality subset of quantization points comprises: determining a number of elements for each subset based on the bitwidth; and (Zivkovic, Col. 12, paragraph 7, “Example 10 includes the method of any one of examples 6 to 9, including or excluding optional features. In this example, the quantization point is selected via a search procedure that is limited to a subset of possible quantization points, wherein the subset of possible quantization points are constrained by the accumulator bit width.”)
extracting a subset corresponding to the number of elements from the candidate set of quantization points. (Zivkovic, Col. 12, paragraph 7, “Example 10 includes the method of any one of examples 6 to 9, including or excluding optional features. In this example, the quantization point is selected via a search procedure that is limited to a subset of possible quantization points, wherein the subset of possible quantization points are constrained by the accumulator bit width.”)
In regard to claim 5 and analogous claim 18, Zivkovic, Nagel and Li teach the method of claim 1.
Nagel further teaches wherein the calculating of the quantization loss comprises calculating the quantization loss based on the received weight of the neural network and a weight quantized by the quantization points included in the extracted subset of quantization points. (Nagel, paragraph 0026, “For these and other reasons, the quantization of large bit-width values into small bitwidth representations may introduce quantization noise on the weights and activations. That is, there may be a significant numerical difference between the parameters in a large bit-width model (e.g., float32, int32, etc.) and their small bit-width representations [the quantization loss based on the received weight of the neural network and a weight quantized by the quantization points included in the extracted subset of quantization points] ( e.g., uint8 output value, etc.). This difference may have a significant negative impact on the efficiency, performance or functioning of the neural network on a small bit-width computing device.”)
Zivkovic and Nagel are combinable for the same rationale as set forth above with respect to claim 1.
In regard to claim 7 and analogous claim 20, Zivkovic, Nagel and Li teach the method of claim 1.
Nagel further teaches wherein the generating of the target subset of quantization points comprises determining the target subset to be one of plurality of subsets, generated by the extracting, of quantization points, which minimizes the quantization loss. (Nagel, paragraph 0066, “The operations in the method 400 may be repeated layer by layer for all layers within the neural network (i.e., incrementing the index i with each repetition), performing cross-layer rescaling where necessary to equalize or normalize weights in any layers where necessary to remove outliers [determining the target subset to be one of the plurality of subsets] and reduce quantization errors [generated by the extracting, of quantization points, which minimizes the quantization loss.]. The result of performing the operations in the method 400 may be a neural network that is suitable for implementation on a computing device configured with smaller bit-width capacity that was used to trained the neural network.”))
Zivkovic and Nagel are combinable for the same rationale as set forth above with respect to claim 1.
In regard to claim 8 Zivkovic, Nagel and Li teach the method of claim 1.
Zivkovic further teaches A non-transitory computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the neural network operation method of claim 1. (Zivkovic, Col. 11, paragraph 2, “The medium 600 may be a computer-readable medium, including a non-transitory medium that stores code that can be accessed by a processor 602 over a computer bus 604.”)
Claims 2-3 and 15-16 are rejected under 35 U.S.C. 103 as being unpatentable over Zivkovic, in view of Nagel and Li, in further view of Cai et al (A Deep Look into Logarithmic Quantization of Model Parameters in Neural Networks, "Cai").
In regard to claim 2 and analogous claim 15, Zivkovic, Nagel and Li teach the method of claim 1.
However, Zivkovic, Nagel and Li do not explicitly teach generating the candidate set of quantization points based on log-scale quantization.
Cai teaches generating the candidate set of quantization points based on log-scale quantization. (Cai, pg. 2, “logarithmic quantization algorithm takes logarithms of the original weights and saves only exponents. Besides, to constrain the exponents within a smaller bitwidth, the decimal parts of the exponents are removed. As a trade-off, quantized weights incur relatively large quantization noises.”)
Zivkovic, Nagel, Li and Cai are related to the same field of endeavor (i.e. quantization). In view of the teachings of Cai, it would have been obvious for a person with ordinary skill in the art to apply the teachings of Cai to Zivkovic, Nagel and Li before the effective filing date of the claimed invention in order to achieve minimum loss of accuracy. (Cai, Abstract, “As the result, our method achieves the minimum accuracy loss on GoogLeNet after direct quantization compared to quantized counterparts.”)
In regard to claim 3 and analogous claim 16, Zivkovic, Nagel, Li and Cai teach the method of claim 2.
Zivkovic further teaches generating the candidate set of quantization points based on a sum of the first quantization point and the second quantization point. (Zivkovic, Col. 6, “For an exemplary network with 8-bit data, 8-bit parameters, and a large N=5x5x32 kernel, approximately 26 bits are needed for the accumulator in the worst case scenario (8+8+1ogi5x5x32)). Thus, in a worst case scenario, the accumulator bit width is greater than or equal to the sum of the parameter bit width, data bit width, and the binary logarithm of the kernel. [the candidate set of quantization points based on a sum of the first quantization point and the second quantization point.]”)
Cai further teaches obtaining a first quantization point based on the log-scale quantization; (Cai, pg. 3, Col. 1, 3.1, “Weights of trained neural networks with large complexity are usually highly concentrated around zero. This non-uniformity motivates Miyashita et al. to replace original parameters using logarithmic data representation since logarithmic quantized weights have similar trend. We restate Miyashita et al.’s algorithm as the following and a graphical view of logarithmic quantization processes is given in figure 3. Their algorithm (details in [20]) will be used in section 4 for performance comparison.”)
obtaining a second quantization point based on the log-scale quantization; and (Cai, pg. 3, Col. 1, 3.1, “Weights of trained neural networks with large complexity are usually highly concentrated around zero. This non-uniformity motivates Miyashita et al. to replace original parameters using logarithmic data representation since logarithmic quantized weights have similar trend. We restate Miyashita et al.’s algorithm as the following and a graphical view of logarithmic quantization processes is given in figure 3. Their algorithm (details in [20]) will be used in section 4 for performance comparison.”)
Zivkovic, Nagel, Li and Cai are combinable for the same rationale as set forth above with respect to claim 2.
Claims 6 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Zivkovic, in view of Nagel and Li, and in further view of Choi et al (Towards the Limit of Network Optimization, "Choi").
In regard to claim 6 and analogous claim 19, Zivkovic, Nagel and Li teach the method of claim 5.
However, Zivkovic, Nagel and Li do not explicitly teach wherein the calculating of the quantization loss based on the received weight of the neural network and the weight quantized by the quantization points included in the extracted subset of quantization points comprises calculating an L2 loss or an L4 loss for a difference between the received weight of the neural network and the quantized weight as the quantization loss.
Choi teaches wherein the calculating of the quantization loss based on the received weight of the neural network and the weight quantized by the quantization points included in the extracted subset of quantization points comprises calculating an L2 loss or an L4 loss for a difference between the received weight of the neural network and the quantized weight as the quantization loss. (Choi, pg. 3, 2.2, “Provided network parameters {wi}N i=1 to quantize, k-means clustering partitions them into k disjoint sets (clusters), denoted by C1, C2, . . . , Ck, while minimizing the mean square quantization error (MSQE) [calculating an L2 loss]…First, although k-means clustering minimizes the MSQE, it does not imply that k-means clustering minimizes the performance loss due to quantization as well in neural networks. K-means clustering treats quantization errors from all network parameters with equal importance. [a difference between the received weight of the neural network and the quantized weight as the quantization loss.]”)
Zivkovic, Nagel, Li and Choi are related to the same field of endeavor (i.e. quantization). In view of the teachings of Choi, it would have been obvious for a person with ordinary skill in the art to apply the teachings of Choi to Zivkovic, Nagel and Li before the effective filing date of the claimed invention in order to minimize performance loss. (Choi, Abstract, “In this paper, we design network quantization schemes that minimize the performance loss due to quantization given a compression ratio constraint.”)
Claims 9-11 and 12-13 are rejected under 35 U.S.C. 103 as being unpatentable over Zivkovic, in view of Nagel and Li, in even further view of Lin et al (A Survey on Enerygy-Efficient Strategies in Static Wireless Sensor Networks, "Lin") and Xie et al (Overflow Aware Quantization: Accelerating Neural Network Inference by Low-bit Multiply-Accumulate Operations, "Xie").
In regard to claim 9, Zivkovic teaches a memory, configured to store a weight of a neural network and a target subset of quantization points extracted from a candidate set of quantization points to quantize the weight of the neural network; (Zivkovic, Col. 1, paragraph 1, “Each input value of each node is given a weight, and each node generates an output value that is a function of the sum of its weighted input values [a weight of a neural network]. The weight that is assigned to a particular input value is determined by a data transfer function, which may be constant or may vary over time.” And Col. 12, paragraph 12, “memory that is to store instructions and that is communicatively coupled to the accumulator; and a processor communicatively coupled to the accumulator and the memory, wherein when the processor is to execute the instructions, the processor is to: determine a parameter quantization and a data quantization using the predicted bit-width;”)
a shifter included in the processor, configured to perform a multiplication operation based on the target quantization point; and (Zivkovic, Col. 12, paragraph 1, “Optionally, a large dataset is used to determine the maximum accumulator value. Optionally, the predicted bit with is a fixed point representation where bits are shifted after each multiplication to ensure fixed number of fractional bits.”)
an accumulator included in the processor, configured to accumulate an output of the shifter. (Zivkovic, Col. 12, paragraph 1, “Optionally, a large dataset is used to determine the maximum accumulator value.”)
However, Zivkovic does not explicitly teach a decoder, configured to select a target quantization point from the target subset of quantization points based on the weight of the neural network;
wherein the target quantization point of the target subset is shared across multiply- accumulate (MAC) operators of the neural network,
thereby reducing a size of the neural network, reducing a buffer memory size used for storing weights, enabling the apparatus to operation with a reduced size of the neural network compared to a non-optimized quantization scheme.
Nagel teaches enabling the apparatus to operation with a reduced size of the neural network compared to a non-optimized quantization scheme. (Nagel, paragraph 0023, “Neural network quantization techniques may be used to reduce size, memory access, and computation requirements of neural network inference by using small bit-width values (e.g., INTS values) in the weights and activations of a neural network model.“ and paragraph 0067, “FIG. 4B illustrates an additional operation neural network quantization method 410 that may be performed in some embodiments as part of performing cross layer rescaling in method 400 to improve quantization in accordance with some embodiments. In the method 400, the corresponding scaling factors are determined so that ranges of weight tensors and channel weights [received weight] within the i'th layer of the neural network may be equalized and outliers removed by scaling from the i'th layer to the adjacent (i.e., i+l) layer.” and paragraph 0068, “In block 412, the processor may determine the corresponding scaling factor used for scaling the i'th layer channel weights in block 402 so as to equalize the ranges within the weight tensor the i'th layer. In some embodiments, the corresponding scaling factor may be determined based on heuristics, equalization of dynamic ranges, equalization of range extrema (minima or maxima), differential learning using STE methods and a local or global loss, or by using a metric for the quantization error and a black box optimizer that minimizes the error metric due to quantization [a reduced size of the neural network compared to a non-optimized quantization scheme.].”)
Zivkovic and Nagel are related to the same field of endeavor (i.e. quantization). In view of the teachings of Nagel, it would have been obvious for a person with ordinary skill in the art to apply the teachings of Nagel to Zivkovic before the effective filing date of the claimed invention in order to improve performance and accuracy of models. (Nagel, paragraph 0036, “For all the reasons discussed above, a computing device configured to perform cross layer rescaling operations in accordance with various embodiments may enhance or improve the performance, accuracy, and precision of the quantized models, and in turn, reduce the computational complexity associated with neural networks.”)
However, Zivkovic and Nagel do not explicitly teach thereby reducing a size of the neural network, reducing a buffer memory size used for storing weights,
a decoder, configured to select a target quantization point from the target subset of quantization points based on the weight of the neural network;
wherein the target quantization point of the target subset is shared across multiply- accumulate (MAC) operators of the neural network,
Li teaches thereby reducing a size of the neural network, reducing a buffer memory size used for storing weights, (Li, pg. 4, Col. 1, IV. MRSB, paragraph 1, “we first describe MRSB architecture that improves the efficient utilization of buffer resources in the router and then augment MRSB with packet merging that reduces buffer storage bandwidth waste. [reducing a buffer memory size used for storing weights,]” And pg. 8, Col. 1, Conclusion, paragraph 1, “For our experimental workloads, the performance of DPU using MRSB is 25.93% higher than DPU using a state-of-the-art router structure. “)
Zivkovic, Nagel and Li are related to the same field of endeavor (i.e. neural networks). In view of the teachings of Li, it would have been obvious for a person with ordinary skill in the art to apply the teachings of Li to Zivkovic and Nagel before the effective filing date of the claimed invention in order to have a more effective buffer memory. (Li, abstract, “For our experimental workloads, experimental results show that MRSB is 221.48% higher effective buffer utilization and 32.98% less latency than a state-of-the-art router with 31.39% smaller area and 29.14% lower power.”)
However, Zivkovic, Nagel and Li do not explicitly teach a decoder, configured to select a target quantization point from the target subset of quantization points based on the weight of the neural network;
wherein the target quantization point of the target subset is shared across multiply- accumulate (MAC) operators of the neural network,
Lin teaches a decoder, configured to select a target quantization point from the target subset of quantization points based on the weight of the neural network; (Lin, 3:7, 2.3.1, “The physical layer provides the support for the process of data sampling and quantizing, bit stream transmitting, receiving, and the relevant signals decoding [33] [a decoder].” And 3:17, paragraph 2, “In some other scenarios, a dedicated Relay Charging Point can be deployed in the network topology, and therefore the mobile relay is able to be replenished periodically at the Relay Charging Point during data forwarding, such as the Recharge Weighed Target Points Patching algorithm [to select a target quantization point from the target subset of quantization points based on the weight of the neural network] (RW-TPP) [109, 110]. For all of the preceding scenarios, the relay moved around periodically to relieve the energy consumption burden on the nodes in the Hot Spot Area.”)
Zivkovic, Nagel, Li and Lin are related to the same field of endeavor (i.e. machine learning optimization). In view of the teachings of Lin, it would have been obvious for a person with ordinary skill in the art to apply the teachings of Lin to Zivkovic, Nagel and Li before the effective filing date of the claimed invention in order to control the error rate. (Lin, pg. 3:7, 2.3.1, “It is mainly responsible for encapsulating data into frames, controlling the error rate during the process of data frame transmission, and MAC.”)
However, Zivkovic, Nagel, Li and Lin do not explicitly teach wherein the target quantization point of the target subset is shared across multiply- accumulate (MAC) operators of the neural network,
Xie teaches wherein the target quantization point of the target subset is shared across multiply- accumulate (MAC) operators of the neural network, (Xie, pg. 868, Col. 2, paragraph 1, “The second group aims at improving the efficiency of arithmetic computation, i.e., the multiply-accumulate (MAC) operation [multiply- accumulate (MAC) operators], as it dominates most computations during the DNN model inference. One widely used method is to approximate the original floating-point calculation using fixed-point operation to achieve computation acceleration. This type of method is well-known as quantization [Jacob et al., 2018]. Representative visualization of quantization is shown in Figure 1(a), where 8-bit fixed-point integers are used to approximate floating-point values and 32- bit fixed-point variables are used to hold MAC results. Moreover, in addition to speeding up the MAC operations, quantization also achieves better parallel computing based on the capability of modern CPUs.”)
Zivkovic, Nagel, Li, Lin and Xie are related to the same field of endeavor (i.e. machine learning optimization). In view of the teachings of Xie, it would have been obvious for a person with ordinary skill in the art to apply the teachings of Xie to Zivkovic, Nagel, Li and Lin before the effective filing date of the claimed invention in order to minimize the quantization loss. (Xie, abstract, “With the proposed method, we are able to fully utilize the computing power to minimize the quantization loss and obtain optimized inference performance.”)
In regard to claim 10, Zivkovic, Nagel, Li, Lin and Xie teach the apparatus of claim 9.
Nagel further teaches wherein the target subset is generated based on the weight of the neural network, and a quantization loss for a subset of quantization points extracted from the candidate set. (Nagel, paragraph 0066, “The operations in the method 400 may be repeated layer by layer for all layers within the neural network (i.e., incrementing the index i with each repetition), performing cross-layer rescaling where necessary to equalize or normalize weights in any layers where necessary to remove outliers [the target subset is generated based on the weight of the neural network,] and reduce quantization errors [a quantization loss for a subset of quantization points extracted from the candidate set.]. The result of performing the operations in the method 400 may be a neural network that is suitable for implementation on a computing device configured with smaller bit-width capacity that was used to trained the neural network.”))
Zivkovic and Nagel are combinable for the same rationale as set forth above with respect to claim 1.
In regard to claim 11, Zivkovic, Nagel, Li, Lin and Xie teach the apparatus of claim 9.
Zivkovic further teaches a first shifter, configured to perform a first multiplication operation for input data based on a first quantization point included in the target quantization point; and (Zivkovic, Col. 14, paragraph 5, “Example 30 includes the apparatus of example 29, including or excluding optional features. In this example, the predicted bit width is the sum of a parameter bit width, and data bit width, and a binary logarithm of a kernel. Optionally, the predicted bit with is a fixed point representation where bits are shifted after each multiplication [a first multiplication operation for input data] to ensure fixed number of fractional bits. [a first quantization point included in the target quantization point]”)
a second shifter, configured to perform a second multiplication operation for the input data based on a second quantization point included in the target quantization point. (Zivkovic, Col. 14, paragraph 5, “Example 30 includes the apparatus of example 29, including or excluding optional features. In this example, the predicted bit width is the sum of a parameter bit width, and data bit width, and a binary logarithm of a kernel. Optionally, the predicted bit with is a fixed point representation where bits are shifted after each multiplication [a second multiplication operation for the input data] to ensure fixed number of fractional bits.”)
In regard to claim 12, Zivkovic, Nagel, Li, Lin and Xie teach the apparatus of claim 9.
Lin further teaches wherein the decoder comprises a multiplexer, configured to multiplex the target quantization point using the weight as a selector. (Lin, 3:7, 2.3.1, “The routing table building, updating, and maintaining for effective data transmission happen at the network layer. The transport layer achieves the end-to-end transmission of the dataflow, the Quality of Service (QoS), and multi-path multiplexing. [a multiplexer, configured to multiplex the target quantization point using the weight as a selector.] At the same time, the transport layer can also promote energy equality by means of reasonably distributing the traffic load.”)
Zivkovic, Nagel and Lin are combinable for the same rationale as set forth above with respect to claim 9.
In regard to claim 13, Zivkovic, Nagel, Li, Lin and Xie teach the apparatus of claim 9.
Xie further teaches wherein the target quantization point is shared between multiply-accumulate (MAC) operators.
Xie teaches wherein the target quantization point is shared between multiply-accumulate (MAC) operators. (Xie, pg. 868, Col. 2, paragraph 1, “The second group aims at improving the efficiency of arithmetic computation, i.e., the multiply-accumulate (MAC) operation [multiply- accumulate (MAC) operators], as it dominates most computations during the DNN model inference. One widely used method is to approximate the original floating-point calculation using fixed-point operation to achieve computation acceleration. This type of method is well-known as quantization [Jacob et al., 2018]. Representative visualization of quantization is shown in Figure 1(a), where 8-bit fixed-point integers are used to approximate floating-point values and 32- bit fixed-point variables are used to hold MAC results. Moreover, in addition to speeding up the MAC operations, quantization also achieves better parallel computing based on the capability of modern CPUs.”)
Zivkovic, Nagel, Li, Lin and Xie are combinable for the same rationale as set forth above with respect to claim 9.
Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Zivkovic, in view of Nagel and Li and in further view of Xie.
In regard to claim 14, Zivkovic teaches a receiver, configured to receive a weight of a neural network, a candidate set of quantization points, and a bitwidth that represents the weight; and (Zivkovic, Col. 1, paragraph 1, “The neural network may be divided into layers, and each layer may include a plurality of nodes. Each node may function similarly to a neuron. In examples, a particular input value may be applied to any number of the nodes of the network. Each input value of each node is given a weight, [receiving a weight of a neural network,] and each node generates an output value that is a function of the sum of its weighted input values.” And Col. 5, paragraph 4, “Moreover, the present techniques introduce a number of constraints that will limit the number of quantization points to consider [a candidate set of quantization points]. Often, searching for an optimal quantization point is slow. A conventional quantization procedures often checks a large number of possible quantization options.” And Col. 7, paragraph 2, “To use a narrow accumulator, as described above a quantization for the network parameters is selected. The network parameter selection may begin with a minimal meaningful value of 1 bit. Often, a higher value may be used, such as 2 or 3 bits, as 1 bit might hold too little information for quantization. In examples, the starting point for parameter quantization is a number between 1 bit and accumulator bit width. [a bitwidth that represents the received weight]”)
one or more processors, configured to extract a subset of quantization points from the candidate set of quantization points based on the bitwidth, (Zivkovic, Col. 12, paragraph 7, “Example 10 includes the method of any one of examples 6 to 9, including or excluding optional features. In this example, the quantization point is selected via a search procedure that is limited to a subset of possible quantization points, wherein the subset of possible quantization points are constrained by the accumulator bit width. [extracting a plurality of subsets of quantization points from the candidate set of quantization points based on the bitwidth;]”)
However, Zivkovic does not explicitly teach calculate a quantization loss based on the weight of the neural network and the subset of quantization points, and
generate a target subset of quantization points based on the calculated quantization loss,
wherein the one or more processors are further configured to accelerate inference speed by sharing the target subset between multiply-accumulate (MAC) operators, thereby reducing a size of the neural network, reducing a buffer memory size used for storing weights, enabling the neural network to operate with a reduced mode size.
Nagel teaches calculate a quantization loss based on the weight of the neural network and the subset of quantization points, and (Nagel, paragraph 0026, “FIG. 1B illustrates that there are five types of loss in the fixed-point quantized pipeline, e.g., input quantization loss, weight quantization loss, runtime saturation loss, activation re-quantization loss, and possible clipping loss for certain non-linear operations. For these and other reasons, the quantization of large bit-width values into small bitwidth representations may introduce quantization noise on the weights and activations. That is, there may be a significant numerical difference between the parameters in a large bit-width model (e.g., float32, int32, etc.) and their small bit-width representations ( e.g., uint8 output value, etc.) [calculating, by a processor, a quantization loss, examiner interprets difference of the small-bit width to the large bit width as the loss being calculated]. This difference may have a significant negative impact on the efficiency, performance or functioning of the neural network on a small bit-width computing device.” And paragraph 0066, “The operations in the method 400 may be repeated layer by layer for all layers within the neural network (i.e., incrementing the index i with each repetition), performing cross-layer rescaling where necessary to equalize or normalize weights in any layers where necessary to remove outliers and reduce quantization errors [calculating]. The result of performing the operations in the method 400 may be a neural network that is suitable for implementation on a computing device configured with smaller bit-width capacity that was used to trained the neural network.”)
generate a target subset of quantization points based on the calculated quantization loss, (Nagel, paragraph 0026, “FIG. 1B illustrates that there are five types of loss in the fixed-point quantized pipeline, e.g., input quantization loss, weight quantization loss, runtime saturation loss, activation re-quantization loss, and possible clipping loss for certain non-linear operations.”)
Zivkovic and Nagel are related to the same field of endeavor (i.e. quantization). In view of the teachings of Nagel, it would have been obvious for a person with ordinary skill in the art to apply the teachings of Nagel to Zivkovic before the effective filing date of the claimed invention in order to improve performance and accuracy of models. (Nagel, paragraph 0036, “For all the reasons discussed above, a computing device configured to perform cross layer rescaling operations in accordance with various embodiments may enhance or improve the performance, accuracy, and precision of the quantized models, and in turn, reduce the computational complexity associated with neural networks.”)
However, Zivkovic and Nagel do not explicitly teach wherein the one or more processors are further configured to accelerate inference speed by sharing the target subset between multiply-accumulate (MAC) operators, thereby reducing a size of the neural network, reducing a buffer memory size used for storing weights, enabling the neural network to operate with a reduced mode size.
Li teaches thereby reducing a size of the neural network, reducing a buffer memory size used for storing weights, enabling the neural network to operate with a reduced mode size. (Li, pg. 4, Col. 1, IV. MRSB, paragraph 1, “we first describe MRSB architecture that improves the efficient utilization of buffer resources in the router and then augment MRSB with packet merging that reduces buffer storage bandwidth waste. [reducing a buffer memory size used for storing weights,]” And pg. 8, Col. 1, Conclusion, paragraph 1, “For our experimental workloads, the performance of DPU using MRSB is 25.93% higher than DPU using a state-of-the-art router structure. [improving efficiency of the DPU that runs the neural network.]” Examiner would like to remind that the target subset with the smallest quantization is mapped to Nagel)
Zivkovic, Nagel and Li are related to the same field of endeavor (i.e. neural networks). In view of the teachings of Li, it would have been obvious for a person with ordinary skill in the art to apply the teachings of Li to Zivkovic and Nagel before the effective filing date of the claimed invention in order to have a more effective buffer memory. (Li, abstract, “For our experimental workloads, experimental results show that MRSB is 221.48% higher effective buffer utilization and 32.98% less latency than a state-of-the-art router with 31.39% smaller area and 29.14% lower power.”)
However, Zivkovic, Nagel and Li does not explicitly teach wherein the one or more processors are further configured to accelerate inference speed by sharing the target subset between multiply-accumulate (MAC) operators,
Xie teaches wherein the one or more processors are further configured to accelerate inference speed by sharing the target subset between multiply-accumulate (MAC) operators, (Xie, pg. 868, Col. 2, paragraph 1, “The second group aims at improving the efficiency of arithmetic computation, i.e., the multiply-accumulate (MAC) operation [multiply- accumulate (MAC) operators], as it dominates most computations during the DNN model inference. One widely used method is to approximate the original floating-point calculation using fixed-point operation to achieve computation acceleration. This type of method is well-known as quantization [Jacob et al., 2018]. Representative visualization of quantization is shown in Figure 1(a), where 8-bit fixed-point integers are used to approximate floating-point values and 32- bit fixed-point variables are used to hold MAC results. Moreover, in addition to speeding up the MAC operations, quantization also achieves better parallel computing based on the capability of modern CPUs.”)
Zivkovic, Nagel, Li and Xie are related to the same field of endeavor (i.e. machine learning optimization). In view of the teachings of Xie, it would have been obvious for a person with ordinary skill in the art to apply the teachings of Xie to Zivkovic, Nagel and Lin before the effective filing date of the claimed invention in order to minimize the quantization loss. (Xie, abstract, “With the proposed method, we are able to fully utilize the computing power to minimize the quantization loss and obtain optimized inference performance.”)
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SKYLAR K VANWORMER whose telephone number is (703)756-1571. The examiner can normally be reached M-F 6:00am to 3:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Usmaan Saeed can be reached on (571) 272-4046. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/S.K.V./ Examiner, Art Unit 2146
/USMAAN SAEED/Supervisory Patent Examiner, Art Unit 2146