DETAILED ACTION
This action is responsive to the claims filed on 03/09/2026. Claims 1-20 are pending for examination.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statements (IDS) submitted on 03/09/2026 are in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statements are being considered by the examiner.
Response to Arguments
The rejection of claims 1-20 under 35 U.S.C. 101 is withdrawn in view of Applicant’s amendments and remarks. In particular, the amended claims further recite specific features directed to mixed-precision deep learning using computational graph nodes, rollback and recalculation at a detected node, and subsequent return to lower-bit operations and the prior rejection under 101 is therefore not maintained.
Applicant’s arguments with respect to the 35 U.S.C. 103 rejection of claims 1, 7, 14, and 20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or non-obviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1-4, 8-11, and 14-17 are rejected under 35 U.S.C. 103 as being unpatentable by Liu et al., (US 2021/0286688 A1), hereafter referred to as Liu, in view of Rouhani et al., (US 20200193274 A1), hereafter referred to as Rouhani, and in further view of Zhu et al., (US 2020/0302283 A1), hereafter referred to as Zhu.
Claim 1: Liu teaches the following limitations:
A non-transitory computer-readable recording medium having stored therein a program that causes a computer to execute a process comprising: (Liu, paragraph
245, “The integrated unit/module may be stored in a computer-readable memory, for example, non-transitory computer-readable memory such as DRAM, SRAM, RRAM, etc., when implemented in the form of a software program module and is sold or used as a separate product.”)
detecting a sign of a failure in learning in operations that are performed with the lower number of bits; (Liu, paragraph 103, “In the present technical scheme, the data bit width n is adjusted according to the quantization error diffbit. Furthermore, the quantization error diffbit is compared with a threshold to obtain a comparison result. The threshold includes a first threshold and a second threshold, and the first threshold is greater than the second threshold. The comparison result may include three situations. If the quantization error diffbit is greater than or equal to the first threshold (situation one), the data bit width can be increased. If the quantization error diffbit is less than or equal to the second threshold (situation two), the data bit width can be reduced. If the quantization error diffbit is between the first threshold and the second threshold (situation three), the data bit width remains unchanged. In practical applications, the first threshold and the second threshold may be empirical values or variable hyperparameters.”, Liu teaches determining quantization error and comparing that error with thresholds during neural-network training/fine-tuning. When the quantization error is greater than or equal to the first threshold, Liu increases the data bit width. Under BRI, excessive quantization error in lower-bit operations is a sign of learning failure or numerical unreliability in lower-bit learning operations. )
and performing a recalculation by an operation with the certain number of bits, (Liu, paragraph 103, “In the present technical scheme, the data bit width n is adjusted according to the quantization error diffbit. Furthermore, the quantization error diffbit is compared with a threshold to obtain a comparison result. The threshold includes a first threshold and a second threshold, and the first threshold is greater than the second threshold. The comparison result may include three situations. If the quantization error diffbit is greater than or equal to the first threshold (situation one), the data bit width can be increased. If the quantization error diffbit is less than or equal to the second threshold (situation two), the data bit width can be reduced. If the quantization error diffbit is between the first threshold and the second threshold (situation three), the data bit width remains unchanged. In practical applications, the first threshold and the second threshold may be empirical values or variable hyperparameters.”, Liu teaches that when the quantization error is greater than or equal to a first threshold, the data bit width can be increased. Thus, upon detection of an abnormal lower-bit result, Liu teaches recalculating or continuing calculation using an increased bit width, corresponding to the claimed operation with the certain number of bits.)
determining whether returning from operations with the certain number of bits to operations with the lower number of bits is allowed, (Liu, paragraph 103, “If the quantization error diffbit is less than or equal to the second threshold (situation two), the data bit width can be reduced.”, Liu teaches comparing quantization error with a second threshold and reducing data bit width when the quantization error is less than or equal to that threshold. This is a determination that returning from higher-bit operations to lower-bit operations is allowed.)
whenever the operation with the certain number of bits is performed at respective nodes subsequent to the first node; (Liu, paragraph 112, “According to the comparison result of the quantization error diffbit and the threshold, the data bit width n used in the quantization of the corresponding layer in the previous iteration or the preset data bit width n of the current layer is adjusted, and the adjusted data bit width is applied to the quantization of the weight of the corresponding layer in the current iteration.”, Liu teaches determining quantization error for the current layer and applying an adjusted data bit width to the corresponding layer in the current iteration. In the Microsoft/Zhu graph-node environment, this repeated layer/node evaluation corresponds to determining whether return is allowed whenever the certain-bit operation is performed at subsequent nodes.)
Rouhani, in the same field of deep learning with varying bit precision, teaches the following which Liu fails to teach:
using deep learning including pre-learning with a floating-point number represented by a certain number of bits (Rouhani, paragraph 85, “At process block 610, parameters, such as weights and biases, of the neural network can be initialized. As one example, the weights and biases can be initialized to random normal-precision floating-point values. As another example, the weights and biases can be initialized to normal-precision floating-point values that were calculated from an earlier training set.”, Rouhani teaches neural-network parameters initialized to normal-precision floating-point values, including values calculated from an earlier training set. Under BRI, the earlier training set corresponds to pre-learning, and the normal-precision floating-point representation corresponds to the claimed floating-point number represented by the certain number of bits.)
and main learning after the pre-learning with a floating-point number represented by a lower number of bits that is smaller than the certain number of bits, (Rouhani, paragraph 36, “In one example of the disclosed technology, a neural network accelerator is configured to accelerate a given layer of a multi-layer neural network using mixed precision data formats. For example, the mixed precision data formats can include a normal-precision floating-point format and a quantized-precision floating-point format. An input tensor for the given layer can be converted from a normal-precision floating-point format to a quantized-precision floating-point format. A tensor operation can be performed using the converted input tensor. A result of the tensor operation can be converted from the block floating-point format to the normal-precision floating-point format. The converted result can be used to generate an output tensor of the layer of the neural network, where the output tensor is in normal-precision floating-point format. In this manner, the neural network accelerator can potentially be made smaller and more efficient than a comparable accelerator that uses only a normal-precision floating-point format. A smaller and more efficient accelerator may have increased computational performance and/or increased energy efficiency. Additionally, the neural network accelerator can potentially have increased accuracy compared to an accelerator that uses only a quantized-precision floating-point format. By increasing the accuracy of the accelerator, a convergence time for training may be decreased and the accelerator may be more accurate when classifying inputs to the neural network.”, Rouhani teaches mixed-precision neural-network training in which normal-precision floating-point input tensors are converted to quantized-precision floating-point format, with mantissa bit widths reduced. Thus, Microsoft teaches main learning using a floating-point representation having a lower number of bits than the normal-precision floating-point representation.)
and in performing the main learning on a computational graph including a plurality of nodes each corresponding to a respective operation, (Rouhani, paragraph 40, “The subgraph accelerator 186 can be programmed to execute a subgraph or an individual node of a neural network. For example, the subgraph accelerator 186 can be programmed to execute a subgraph included a layer of a NN. The subgraph accelerator 186 can access a local memory used for storing weights, biases, input values, output values, and so forth. The subgraph accelerator 186 can have many inputs, where each input can be weighted by a different weight value. For example, the subgraph accelerator 186 can produce a dot product of an input tensor and the programmed input weights for the subgraph accelerator 186. In some examples, the dot product can be adjusted by a bias value before it is used as an input to an activation function. The output of the subgraph accelerator 186 can be stored in the local memory, where the output value can be accessed and sent to a different NN processor core and/or to the neural network module 130 or the memory 125, for example.”, Rouhani teaches that a neural-network accelerator may execute a subgraph or individual node of a neural network, and that each node produces an output by applying weights to inputs from preceding nodes. Thus, Rouhani teaches a computational graph having plural nodes, each corresponding to an operation in the neural network.)
and performing the operation with the certain number of bits at respective nodes subsequent to the first node; (Rouhani, paragraph 109, “At process block 960, the converted result in the normal-precision floating-point format can be used to generate an output tensor of the layer of the neural network, where the output tensor is in normal-precision floating-point format. The values transferred between the layers of the neural network can be passed in the normal-precision floating-point format, which may increase an accuracy of the neural network allowing for faster convergence during training and for more accurate inferences. By updating the output tensor of the layers of the neural network, the neural network can potentially classify input data (such as image data, audio data, or other sensory data) into categories.”, Rouhani teaches that values transferred between layers may be passed in normal-precision floating-point format, and that operations may be performed using converted normal-precision floating-point results. In the proposed combination, after rollback/recomputation begins at the first node, subsequent graph nodes are performed using the higher/normal-precision representation.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Liu’s adaptive bit-width control technique with the mixed-precision neural-network training architecture of Rouhani. Liu teaches determining a quantization error during neural-network training, comparing the quantization error to first and second thresholds, increasing data bit width when the quantization error is greater than or equal to the first threshold, and reducing data bit width when the quantization error is less than or equal to the second threshold. Rouhani teaches training/evaluating a neural-network graph or subgraphs using normal-precision floating-point values and quantized floating-point values, including converting normal-precision floating-point values to quantized-precision floating-point values and performing tensor operations in the quantized-precision floating-point format. A person of ordinary skill in the art would have been motivated to incorporate Liu’s threshold-based adaptive bit-width adjustment into Rouhani’s mixed-precision neural-network training system to improve numerical reliability while preserving the known speed, memory, and computational-efficiency benefits of lower-precision floating-point operation. Both Liu and Rouhani address the same general problem of reducing computational cost in neural-network processing while maintaining sufficient numerical precision. The combination would have amounted to applying Liu’s known adaptive precision-control technique to Rouhani’s known mixed-precision neural-network training environment, with the predictable result that lower-bit operations are used when acceptable and higher-bit operations are used when the lower-bit representation produces excessive error.
Zhu, in the same field of computational graphing, teaches the following which Liu and Rouhani fails to teach:
upon detection of the sign at a first node of the plurality of nodes, rolling back to the first node and performing a recalculation by an operation with the certain number of bits, (Zhu, paragraph 71, “The routine 700 begins at operation 702, where the ANN training module 106 defines an ANN that includes a number of sets of nodes. In some configurations, the sets of nodes are sets of layers as depicted in FIG. 4, such that activation values generated by one layer are supplied as inputs to another layer.”; Zhu, paragraph 72, “If the difference in accuracy exceeds a defined threshold, bit width for the set of nodes may be increased.”; Zhu, paragraph 73, “quantization error may be computed or estimated by repeating only a portion of the computation in high precision. The repeated portion of the computation may be a computation over a subset of layers, steps, epochs, and/or inputs.”, Zhu teaches an ANN including sets of nodes/layers, detecting that a difference in accuracy or quantization error exceeds a defined threshold, increasing the bit width for the affected set of nodes, and repeating only a portion of the computation in high precision. Under BRI, the affected set of nodes/layers at which the threshold condition is detected corresponds to the claimed first node of the plurality of nodes, and repeating that portion of the computation in high precision corresponds to rolling back to that first node/portion and performing a recalculation by an operation with the certain number of bits.)
upon detection of a second node that allows the returning to operations with the lower number of bits, performing operations with the lower number of bits at respective nodes subsequent to the second node. (Zhu, paragraph 72, “If the difference in accuracy falls below another defined threshold, bit width for the set of nodes may be decreased in order to save computing resources.”; Zhu, paragraph 76, “At operation 706, ANN training module 106 sets a second bit width for activation values for a second set of the plurality of sets of nodes.”; Zhu, paragraph 79, “The routine 700 then proceeds from operation 710 to operation 712, where it ends.” Zhu teaches detecting that the accuracy difference falls below another defined threshold and, in response, decreasing the bit width for a set of nodes to save computing resources. Zhu further teaches setting a second bit width for activation values for a second set of nodes and training the ANN using activation functions that produce values having the assigned bit widths. Under BRI, the set of nodes at which the lower-error threshold condition is satisfied corresponds to the claimed second node that allows returning to lower-bit operations, and the decreased bit width applied to the set of nodes teaches performing subsequent operations using the lower number of bits.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to further modify the combined Liu/Rouhani system with Zhu’s threshold-based mixed-precision node/layer training technique. Liu teaches adaptive bit-width control based on quantization-error thresholds, and Rouhani teaches neural-network graph execution using normal-precision and quantized-precision floating-point formats. Zhu further teaches an ANN including sets of nodes/layers, detecting that a quantization-error or accuracy difference exceeds a threshold, increasing bit width for the affected set of nodes, repeating only a portion of the computation in high precision, and decreasing bit width when the accuracy difference falls below another threshold. A person of ordinary skill in the art would have been motivated to incorporate Zhu’s node/layer-level high-precision recomputation and bit-width decrease technique into the Liu/Rouhani mixed-precision graph-training system so that, when Liu’s threshold condition indicates lower-bit unreliability, the affected node/layer portion can be recomputed using higher precision, and when the lower-error threshold condition is satisfied, subsequent operations can return to lower-bit precision to save computing resources. The combination would predictably improve numerical reliability while preserving the speed, memory, and power benefits of lower-bit mixed-precision training.
Claim 2:
The non-transitory computer-readable recording medium according to claim 1, wherein the switching to operations with the lower number of bits comprises switching to deep learning integer (DL-INT) operations or quantized integer (QINT) operations. (Liu, paragraph 172, “The quantization parameter is used by an artificial intelligence processor to quantize data involved in the process of neural network operation and convert high-precision data into low-precision fixed-point data, which may reduce storage space of the data involved in the process of neural network operation. For example, conversion of float32 to fix8 may reduce a model parameter by four times.”, In Liu, the quantization method determines a data bit width that the AI processor uses to “covert the high precision data into the low-precision fixed-point number” with an example of converting float32 to fix8. Float32 arithmetic is the high-precision mode and the fix8 (8-bit fixed-point) is a low precision, quantized integer representation used by the neural network.)
Claim 3:
The non-transitory computer-readable recording medium according to claim 1, wherein the detecting the sign comprises detecting the sign when a difference between Q-values of input tensors is out of an allowable range. (Liu, paragraph 119, “To sum up, an extension strategy of the data bit width and the quantization parameter is determined based on the similarity between data. If the similarity exists between data, the data bit width and the quantization parameter can be continuously used. If no similarity exists between data, the data bit width or the quantization parameter needs to be adjusted. The similarity between data is usually measured by KL divergence or by a following formula (17).
PNG
media_image1.png
16
266
media_image1.png
Greyscale
”, In Liu, each tensor (e.g., weight or activation) has associated quantization parameters (point location, scaling factor, etc.), which are exactly the Q-values derived from its data. The reference then defines a “similarity between the data” that is measured by KL divergence or a formula (17) involving differences of statistics like max and mean between data A and data B. When this similarity test indicates “no similarity between the data,” the method says “the data bit width or the quantization parameter are required to be adjusted.” In claim terms, you can view the similarity metric as a difference between the Q-related values of successive input tensors (or the same tensor at different times); if that difference exceeds an implicit allowable range (“no similarity”), this triggers an adjustment. That adjustment condition is thus a thresholded difference between Q-values of input tensors which serves as a “sign” that the current quantization configuration is no longer acceptable.)
Claim 4:
The non-transitory computer-readable recording medium according to claim 1, wherein the detecting the sign comprises detecting the sign when a range of variation of Q-values between output tensors is greater than or equal to a certain threshold. (Liu, Paragraph 125, “
PNG
media_image2.png
31
312
media_image2.png
Greyscale
”Paragraph 140, “
PNG
media_image3.png
27
173
media_image3.png
Greyscale
[140] In the formula (21), δ refers to a hyperparameter; diffbit refers to a quantization error; and diffupdate2 refers to a variation trend value of data bit width. The variable diffupdate2 measures the variation trend of the data bit width n used in quantization. A greater diffupdate2 indicates that a fixed-point bit width needs to be updated and an update frequency with a shorter interval is needed. [141] The variation trend value of the point position parameter shown in FIG. 7 may still be obtained according to the formula (18), and in the formula (18) is obtained according to the formula (19). diffupdate1 measures the variation trend of the point position parameter s, in which the variation of the point position parameter s is reflected in the variation of the maximum value Zmax of the current data to be quantized. A greater diffupdate1 indicates a larger variation range of numerical values and requires the update frequency with a shorter interval, which means a smaller target iteration interval.”, Here, the position parameter s and the data bit width n are both quantization parameters, i.e., Q-values, that are tracked over time for a given tensor. The document defines variation trend values diffupdate1 and diffupdate2 as quantitative measures of how much these Q-values are changing across weight-update iterations, stating that “a greater diffupdate1 indicates a larger variation range of numerical values” and that large diffupdate1/diffupdate2 imply the bit width “is more likely to be required to be updated” and that a shorter update interval (smaller target iteration interval) is needed. In effect, the method monitors the range of variation of Q-values (s and n) between successive outputs (after each weight update), and when this variation trend exceeds the design’s implicit threshold (diffupdate values become large enough that the target iteration interval is reduced and an update becomes necessary), this is treated as the trigger to adjust quantization. That behavior corresponds to detecting a sign when the range of variation of Q-values between output tensors is ≥ a threshold.)
Claims 8-11 and 14-17 recite limitations substantially similar to claims 1-4, therefore a similar analysis applies.
Claim 8 recites additional elements for consideration:
An information processing method executed by a computer, the method comprising… by a processor on the computer (Liu, paragraph 12, “The present disclosure provides a neural network quantization parameter determination device including a memory and a processor, in which the memory stores a computer program that can run on the processor, and steps of the above method are implemented when the processor executes the computer program.”)
Claim 14 recites additional elements for consideration:
An information processing apparatus comprising a processor configured to execute a process comprising (Liu, paragraph 12, “The present disclosure provides a neural network quantization parameter determination device including a memory and a processor, in which the memory stores a computer program that can run on the processor, and steps of the above method are implemented when the processor executes the computer program.”)
Claims 5, 12 and 18 are rejected under 35 U.S.C. 102 as being unpatentable by Liu in view of Rouhani, and in further view of Zhu and Kim et al.,(Kim, D., Ahn, J., & Yoo, S. (2017, March). A novel zero weight/activation-aware hardware architecture of convolutional neural network. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017 (pp. 1462-1467). IEEE.), hereafter referred to as Kim.
Claim 5: Kim, in the same field of neural network hardware implementation, teaches the following limitation which Liu, Rouhani, and Zhu fails to teach:
The non-transitory computer-readable recording medium according to claim 1, wherein the detecting the sign comprises: determining whether sampled values to be used for calculating a Q-value are all zeros, and based on past Q-values, detecting the sign when the sampled values are all zeros. (Kim, page 1463, col. 1, section A, paragraph 2, “This is because the activation is shared by the two convolutions, and thus, computation associated with zero weights can be skipped only if all the kernel weights are zero at the same cycle.”, Kim’s architecture explicitly checks whether a group of weights is all zero before skipping computation. Those weights/activations are “sampled values” for a convolution tile; checking “all the kernel weights are zero” corresponds to determining whether all sampled values are zero. Kim also uses non-zero bit vectors to record zero vs non-zero status for tiles (state/history), analogous to storing past Q-values / statistics. When an all-zero pattern is detected, the hardware treats this as a special condition (sign), which is parallel to claiming a sign when all sampled values used to compute a Q-value are zero.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have incorporated the teachings of Liu, Rouhani, and Zhu with the teachings of Kim, in order to improve Liu’s detection of numerically or functionally “dead” regions in the network by using well-known all-zero weight/activation checks. Kim proposes a hardware architecture that explicitly tests whether all kernel weights or activations in a tile are zero and uses metadata (bit-vectors) to mark and exploit these all-zero patterns so that computation for those tiles can be skipped. A person of ordinary skill would have recognized that Kim’s “all-zero tile” detection is a natural concrete instantiation of Liu’s more abstract “sampled values used to calculate a Q-value” being all zeros: the tile weights/activations are the sampled values, and “all zero” is an abnormal pattern that should influence the Q-value-based decision logic. Incorporating Kim’s all-zero detection into Liu’s framework would thus allow Liu’s system to treat an all-zero sample set—compared against stored state/history of Q-values or bit-vectors—as a sign condition, improving both efficiency (skip compute) and robustness (flag dead channels) in exactly the way claimed. (Kim, page 1463, col. 1, section A, paragraph 1, “Thus, multiplications and accumulations associated with zero activations can be skipped in those two convolutions in a synchronous manner”)
Claims 12 and 18 recite limitations substantially similar to claim 5, therefore a similar analysis applies.
Claims 6, 13 and 19 are rejected under 35 U.S.C. 102 as being unpatentable by Liu in view of Rouhani and Zhu in further view of Yamaguchi et al.,(Yamaguchi, H., Ito, M., Yoda, K., & Ike, A. (2021, February). Training deep neural networks in 8-bit fixed point with dynamic shared exponent management. In 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE) (pp. 1536-1541). IEEE.), hereafter referred to as Yamaguchi.
Claim 6: Yamaguchi, in the same field of neural network hardware implementation, teaches the following limitation which Liu, Rouhani and Zhu fails to teach:
The non-transitory computer-readable recording medium according to claim 1, wherein the detecting the sign comprises detecting the sign when number of elements undergoing overflows or underflows is greater than or equal to a certain threshold relative to number of elements to be sampled. (Yamaguchi, page 1539, col. 2, last paragraph, “As a result, the ratio of overflow data has exceeded the 1%,, . Values are unexpectedly overflow, as shown by the red bar in Fig. 10, and the precision of the data is degraded with shared exponent y, = —22. To prevent such overflow, we add an “offset” to the shared exponent.”, Yamaguchi directly uses a ratio of overflow data—i.e., number of overflows divided by total elements—as a monitored statistic. When this ratio exceeds r_max (a predefined threshold), they treat it as an abnormal condition and adjust quantization parameters (shared exponent). The “ratio of overflow data” is being interpreted as the number of elements undergoing overflow relative to the number of elements sampled, and r_max is the “certain threshold.” Exceeding r_max is thus a sign of failure based on overflow/underflow frequency.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have incorporated the teachings of Liu, Rouhani, and Zhu with the teachings of Yamaguchi, in order to make Liu’s detection of numeric failure more concrete by using an explicit overflow ratio threshold over tensor elements. Yamaguchi’s INT8-with-DSE method quantizes tensors with a shared exponent and directly measures the ratio of tensor elements that overflow, comparing this ratio against a predetermined maximum (e.g., 1%); when the ratio exceeds that threshold, they treat this as a problem and adjust the shared exponent. A person of ordinary skill would have found it straightforward and advantageous to adopt Yamaguchi’s overflow ratio as the concrete statistic in Liu’s framework, interpreting “ratio of overflow data greater than r_max” as the event “number of elements undergoing overflow… is greater than or equal to a certain threshold relative to the number of elements to be sampled,” and then using Liu’s existing mechanism to treat such an event as a sign of failure in learning that triggers rollback or exponent/Q-value adjustment. This combination uses an already-known overflow-rate test to implement Liu’s threshold-based sign detection in mixed-precision training. (Yamaguchi, page 1539, col. 2, last paragraph, “As a result, the ratio of overflow data has exceeded the 1%,, . Values are unexpectedly overflow, as shown by the red bar in Fig. 10, and the precision of the data is degraded with shared exponent y, = —22. To prevent such overflow, we add an “offset” to the shared exponent.”)
Claims 13 and 19 recite limitations substantially similar to claim 6, therefore a similar analysis applies.
Claims 7 and 20 are rejected under 35 U.S.C. 102 as being unpatentable by Liu in view of Rouhani and Zhu, and in further view of Narihira et al, (Narihira, T., Alonsogarcia, J., Cardinaux, F., Hayakawa, A., Ishii, M., Iwaki, K., ... & Yoshiyama, K. (2021). Neural Network Libraries: A Deep Learning Framework Designed from Engineers' Perspectives. arXiv preprint arXiv:2102.06725.), hereafter referred to as Narihira and Nvidia, (Mixed precision training¶. Mixed Precision Training - OpenSeq2Seq 0.2 documentation. (2020, October 29). https://web.archive.org/web/20201029204617/https://nvidia.github.io/OpenSeq2Seq/html/mixed-precision.html ), hereafter referred to as Nvidia.
Claim 7: Narihai, in the same field of neural network mixed precision training, teaches the following limitation which Liu, Rouhani, and Zhu fails to teach:
The information processing apparatus according to claim 14, wherein the determining whether the returning is allowed comprises recording, in a status counter, (Narihari, page 6, last paragraph, “Use dynamic loss scaling to prevent overflow/underflow scaling_factor = 2 counter = 0 interval = 2000”, Narihira teaches dynamic loss scaling using a counter and an interval. The counter is a status counter because it records training status over iterations and is used to decide when loss scale may be increased, i.e., when returning toward more aggressive lower-bit operation is allowed.)
each of a number of operations with a certain number of bits, (Narihari, page 6, last paragraph, “In NNL, mixed precision training can be used by setting type_config as half in extension context setting. When using mixed precision training with NVIDIA Volta, storage (weights, activations, gradients) is performed in FP-16. Forward and back-propagation employ TensorCore, where batch normalization is in FP-32. Update is also performed in FP-32, although the weights are managed in both FP-16 and 32.”, Narihira teaches mixed precision training in which updates are performed in FP32 and weights are managed in both FP16 and FP32. Each counted stable training iteration includes the FP32 update path, corresponding to operations with the certain number of bits.)
a number of operations with a lower number of bits, (Narihari, page 6, last paragraph, “When using mixed precision training with NVIDIA Volta, storage (weights, activations, gradients) is performed in FP-16.”, Narihira teaches that in mixed precision training, storage of weights, activations, and gradients is performed in FP16, and forward/back-propagation uses Tensor Cores. Thus, each training iteration includes lower-bit operations, corresponding to operations with the lower number of bits.)
and a number of retries of the operations with the lower number of bits (Narihari, page 7 listing 6, “if solver.check_inf_or_nan_grad(): loss_scale /= scaling_factor counter = 0”, Narihira teaches checking for inf/NaN gradients and resetting the counter when such an abnormality is detected. Because the update occurs only in the non-abnormal branch, the abnormal branch corresponds to an unsuccessful lower-bit mixed-precision attempt that must be backed off/retried.)
and acquiring respective values from the status counter (Narihari, page 7 listing 6, “if counter > interval:”, Narihira teaches evaluating the value of the counter by determining whether “counter > interval,” and then increasing the loss scale when that counter condition is satisfied. Because the counter value represents the number of successful repeated mixed-precision iterations since the prior reset, and because each such iteration includes both FP16 lower-bit operations and FP32 certain-bit operations, acquiring the counter value corresponds to acquiring the respective operation-status values from the status counter for determining whether return toward lower-bit operation is allowed.)
to determine whether the returning to operations with the lower number of bits is allowed by determining whether the operation with the certain number of bits is repeated for a certain number of times (Narihira, page 7, Listing 6, “if counter > interval: loss_scale *= scaling_factor counter = 0 counter += 1”; Narihira, page 6, “Update is also performed in FP-32, although the weights are managed in both FP-16 and 32.” Narihira teaches that the counter is incremented after successful training iterations and that the loss scale is increased when the counter exceeds the interval. Narihira also teaches that each mixed-precision training iteration includes FP32 update operations, which correspond to operations with the certain number of bits. Therefore, determining whether “counter > interval” determines whether the FP32/certain-bit update operation has been repeated for a certain number of times. Increasing the loss scale after the counter exceeds the interval permits a return toward more aggressive lower-bit mixed-precision operation because the system increases the dynamic range used for FP16 training after sufficient stable repeated operations.)
It would have been obvious to one of ordinary skill in the art before the effective filing date to further modify the Liu/Rouhani/Zhu system with the dynamic loss-scaling counter logic taught by Narihira. Liu teaches increasing and reducing bit width based on detected error thresholds, Rouhani teaches a mixed-precision floating-point graph-training environment, and Zhu teaches graph rollback/recomputation using checkpointing and local recomputation. Narihira teaches mixed-precision training in which storage of weights, activations, and gradients is performed in FP16, update operations are performed in FP32, and weights are managed in both FP16 and FP32, and further teaches an automatic/dynamic loss-scaling process that initializes a counter, sets an interval, checks for inf or NaN gradients, resets the counter when such an abnormality occurs, and increases the loss scale when the counter exceeds the interval. A person of ordinary skill in the art would have been motivated to incorporate Narihira’s counter-based dynamic loss-scaling logic into the Liu/Rouhani/Zhu mixed-precision graph-training system to provide a concrete and known mechanism for determining when return to lower-bit mixed-precision operation is numerically safe. The combination would have predictably allowed the system to record operational status, count stable repeated operations, reset or retry upon abnormality, and permit return toward lower-bit operation only after the relevant counter condition is satisfied.
Nvidia, in the same field of neural network mixed precision training, teaches the following limitation which Liu, Rouhani, Zhu, and Narihai fails to teach:
or by determining whether a retry rate in the repeated operation with the certain number of bits is below a certain threshold. (Nvidia, page 4, paragraph 1, “Backoff scaling begins with a large loss scale and checks for overflow in the parameter checks gradients at the end of each iteration. Whenever there is an overflow, the loss scale decreases by a constant factor (default is 2) and the optimizer will skip the update. Furthermore, if there has been no overflow for a period of time, the loss scale increases by a constant factor (defaults are 2000 iterations and 2, respectively). These two rules together ensure both that the loss scale is as large as possible and also that it can adjust to shifting dynamic range during training”, NVIDIA teaches Backoff scaling in which overflow causes scale decrease and skipped update, while no overflow for a period causes the scale to increase. A no-overflow period corresponds to a retry/abnormality rate of zero, which is below a threshold. Thus, NVIDIA teaches determining that return to lower-bit operation is allowed when the retry rate during the repeated operation period is below a threshold.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have incorporated the teachings of Liu/Rouhani/Zhu/Narihari with the teachings of Nvidia’s Backoff scaling in order to implement Liu’s “abnormality occurrence rate” condition for returning from high-precision back to low-precision operations using a windowed overflow-rate test over iterations. Liu provides a control loop that temporarily uses operations with a higher number of bits when a sign of failure is detected and allows a return to lower-bit operations when it determines that such a return is safe, based on abnormality rates. Nvidia’s Backoff algorithm does almost exactly that in practice: it repeatedly performs mixed-precision training steps, checks for gradient overflow at each iteration, and decreases the loss scale and skips the update when overflow occurs, effectively falling back to safer numeric conditions. It then defines a period of iterations with no overflow as the stability criterion upon which it increases the loss scale again, effectively returning to more aggressive low-precision usage. A person of ordinary skill would have immediately recognized that this “no overflow for K iterations” pattern is a specific implementation of Liu’s general requirement that after repeating training for a first number of times, if the abnormality occurrence rate is not greater than a second threshold, returning to low precision is allowed, and would have been motivated to reuse Nvidia’s well-tested Backoff behavior as a method inside Liu’s precision-switching framework. (Nvidia, Automatic Loss Scaling, “Backoff scaling begins with a large loss scale and checks for overflow in the parameter gradients at the end of each iteration. Whenever there is an overflow, the loss scale decreases by a constant factor (default is 2) and the optimizer will skip the update. Furthermore, if there has been no overflow for a period of time, the loss scale increases by a constant factor (defaults are 2000 iterations and 2, respectively). These two rules together ensure both that the loss scale is as large as possible and also that it can adjust to shifting dynamic range during training.”)
Claim 20 recite limitations substantially similar to claim 7, therefore a similar analysis applies.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
US20200210840A1 - Adjusting precision and topology parameters for neural network training based on a performance metric
Hu, H., Peng, R., Tai, Y. W., & Tang, C. K. (2016). Network trimming: A data-driven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250.
Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., ... & Wu, H. (2017). Mixed precision training. arXiv preprint arXiv:1710.03740.
Zhao, R., Vogel, B., & Ahmed, T. (2019). Adaptive loss scaling for mixed precision training. arXiv preprint arXiv:1910.12385.
US 20190340499 A1
Taras, I., & Stuart, D. M. (2018). Quantization error as a metric for dynamic precision scaling in neural net training. arXiv preprint arXiv:1801.08621.
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HYUNGJUN B YI whose telephone number is (703)756-4799. The examiner can normally be reached M-F 9-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Usmaan Seed can be reached at (571) 272-4046. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/H.B.Y./Examiner, Art Unit 2146
/USMAAN SAEED/Supervisory Patent Examiner, Art Unit 2146