Office Action Analysis: 17852484 — Hardware-Aware Mixed-Precision Quantization

Examiner Intelligence

DETERDING, GWYNEVERE AMELIA View full profile →
Grants 100% — above average
Career Allow Rate
2 granted / 2 resolved
+45.0% vs TC avg
Strong +100% interview lift
Without
With
+100.0%
Interview Lift
resolved cases with interview
Typical timeline
3y 3m
Avg Prosecution
14 currently pending
Career history
16
Total Applications
across all art units
Statute-Specific Performance

§101
21.3%
-18.7% vs TC avg
§103
32.0%
-8.0% vs TC avg
§102
8.0%
-32.0% vs TC avg
§112
20.0%
-20.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 2 resolved cases
Office Action

§101 §103 §112
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-20 are presented for examination.

Claim Objections
Claims 3, 7-8, 14, and 17-18 are objected to because of the following informalities: 
	Claim 3: “one of a” should read “one of”
	Claims 7 and 17: "goup AQS" should read "group AQS"
Claims 8 and 18: “the maximum AQS value” should read “a maximum AQS value”
Claim 14: “The system of claim 12, the given” should read “The system of claim 12, wherein the given”

Specification
The disclosure is objected to because of the following informalities:
Paragraph 29: "Activation quantizers 316 and 317" should read "Activation quantizers 317 and 327"; and the description reads that weight quantizer 319 is applied to Activation_i(t) 235, however corresponding Fig. 2B depicts the weight quantizer 319 applied to weight(t) 218
Paragraph 31: "Group formation is depending" should read "Group formation is dependent"
Paragraph 38: "apply to each" should read "are applied to each"
Paragraph 44: "is chosen in is chosen in this example" should read "is chosen in this example"
Paragraph 49: "steps 642-644 are repeated for each quantization group" should read "steps 642-645 are repeated for each quantization group" based on corresponding Fig. 6
Appropriate correction is required.

Drawings
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(5) because (a) the following reference character(s) not mentioned in the description: 327, 645; (b) reference character “316” has been used to designate both an activation quantizer in paragraph 29 and Output(t) in Fig. 2B, and reference character “644” has been used to designate both calculating a group quality for the quantization group in paragraph 49 and “Average group AQS” in Fig. 6; and (c) Fig. 5A contains text that is too small and fuzzy to be read clearly. Corrected drawing sheets in compliance with 37 CFR 1.121(d), or amendment to the specification to add the reference character(s) in the description in compliance with 37 CFR 1.121(b) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The analysis of the claims will follow the 2019 Revised Patent Subject Matter Eligibility Guidance (“2019 PEG”).

Claim 1
Step 1: The claim is directed to a method, and is therefore directed to the statutory category of processes.
Step 2A Prong 1: The claim recites:
calculating an activation quantization sensitivity (AQS) value for each of a plurality of convolution layers in a neural network, wherein the AQS value indicates sensitivity of convolution output to quantized convolution input; This limitation recites a mathematical calculation
forming a plurality of quantization groups by grouping one or more convolution layers into a quantization group to be executed by a corresponding set of target hardware; This limitation encompasses mentally sorting convolutional layers into a plurality of quantization groups
calculating a group AQS value for each quantization group based on AQS values of the convolution layers in the quantization group; This limitation recites a mathematical calculation
selecting bit-widths supported by the target hardware platform for corresponding quantization groups to optimize, under a given constraint, a sensitivity metric that is calculated based on each quantization group's group AQS value; This limitation encompasses mentally selecting a bit-width for each quantization group that is supported by the target hardware platform and optimizes a sensitivity metric calculated based on each quantization group’s group AQS value
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim further recites that the method for determining bit-widths is “for mixed-precision neural network computing on a target hardware platform.” However, this limitation amounts to merely indicating a field of use in which to apply a judicial exception (MPEP 2106.05(h)).
Step 2B: The claim does not contain significantly more than the judicial exception. The “for mixed-precision neural network computing on a target hardware platform” limitation amounts to merely indicating a field of use in which to apply a judicial exception as stated above (MPEP 2106.05(h)). As an ordered whole, the claim is directed to the abstract idea of performing mathematical calculations to determine optimal bit-widths for a neural network model. Nothing in the claim provides significantly more than this. As such, the claim is not patent eligible.

Claim 2
Step 1: A process, as above.
Step 2A Prong 1: The claim recites:
performing Integer Linear Programming (ILP) to obtain a mixed-precision quantization configuration for the neural network that minimizes a total AQS under the given constraint, wherein the total AQS is a sum of all group AQS values in the neural network; This limitation recites the mathematical concept of performing Integer Linear Programming
Step 2A Prong 2: This judicial exception is not integrated into a practical application. See analysis of claim 1.
Step 2B: The claim does not contain significantly more than the judicial exception. See analysis of claim 1.

Claim 3
Step 1: A process, as above.
Step 2A Prong 1: The claim recites:
wherein the given constraint is one of a model size, latency, and total binary operations of the neural network; This limitation merely further limits the constraint used in performing Integer Linear Programming, which is a mathematical concept
Step 2A Prong 2: This judicial exception is not integrated into a practical application. See analysis of claim 1.
Step 2B: The claim does not contain significantly more than the judicial exception. See analysis of claim 1.

Claim 4
Step 1: A process, as above.
Step 2A Prong 1: The claim recites:
wherein the given constraint is total binary operations (BOPS), which limits a sum of per-group BOPS, and wherein per-group BOPS is calculated as the number of MAC operations multiplied by the bit-width of input activation and the bit-width of weights; This limitation merely further limits the constraint used in selecting bit-widths for quantization groups. Selecting bit-widths to optimize a sensitivity metric under a given constraint is still mentally performable using this constraint. 
Step 2A Prong 2: This judicial exception is not integrated into a practical application. See analysis of claim 1.
Step 2B: The claim does not contain significantly more than the judicial exception. See analysis of claim 1.

Claim 5
Step 1: A process, as above.
Step 2A Prong 1: The claim recites:
selecting an input activation bit-width and a weight bit-width for each quantization group to optimize the sensitivity metric under the given constraint; This limitation encompasses mentally selecting an input activation bit-width and a weight bit-width for each quantization group that optimizes a sensitivity metric under a given constraint
Step 2A Prong 2: This judicial exception is not integrated into a practical application. See analysis of claim 1.
Step 2B: The claim does not contain significantly more than the judicial exception. See analysis of claim 1.

Claim 6
Step 1: A process, as above.
Step 2A Prong 1: The claim recites:
identifying one or more quantization groups to lower corresponding bit-widths under a user-defined accuracy criterion; This limitation encompasses mentally identifying quantization groups to lower corresponding bit-widths based on an accuracy criterion defined by a user
Step 2A Prong 2: This judicial exception is not integrated into a practical application. See analysis of claim 1.
Step 2B: The claim does not contain significantly more than the judicial exception. See analysis of claim 1.

Claim 7
Step 1: A process, as above.
Step 2A Prong 1: The claim recites:
wherein an identified quantization group is one having a highest group quality                     
                        
                                l
                                o
                                g
                                
                                        M
                                        A
                                        C
                                        +
                                        1
                                    
                                g
                                o
                                u
                                p
                                 
                                A
                                Q
                                S
                            
                 among all quantization groups, wherein MAC represents a number of multiply-and-add operations; This limitation merely further limits the identifying limitation of claim 6. The identifying limitation is still mentally performable, as one can mentally identify a quantization group having a highest group quality
Step 2A Prong 2: This judicial exception is not integrated into a practical application. See analysis of claim 1.
Step 2B: The claim does not contain significantly more than the judicial exception. See analysis of claim 1.

Claim 8
Step 1: A process, as above.
Step 2A Prong 1: The claim recites:
wherein the group AQS value is the maximum AQS value of all convolution layers in the quantization group; This limitation merely further limits the calculating a group AQS value limitation of claim 1, which still constitutes a mathematical calculation. Additionally, finding a maximum value is mentally performable.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. See analysis of claim 1.
Step 2B: The claim does not contain significantly more than the judicial exception. See analysis of claim 1.

Claim 9
Step 1: A process, as above.
Step 2A Prong 1: The claim recites:
calculating a first output of a convolution layer using input activation and weights at step t; This limitation recites a mathematical calculation
calculating a second output of the convolution layer using the input activation and the weights at step (t-1); This limitation recites a mathematical calculation
measuring a difference between the first output and the second output to compute the AQS value of the convolution layer; This limitation recites the mathematical concept of measuring a difference between two values
Step 2A Prong 2: This judicial exception is not integrated into a practical application. See analysis of claim 1.
Step 2B: The claim does not contain significantly more than the judicial exception. See analysis of claim 1.

Claim 10
Step 1: A process, as above.
Step 2A Prong 1: The claim recites:
calculating a first output of a convolution layer using input activation and weights at a first data precision; This limitation recites a mathematical calculation
calculating a second output of the convolution layer using the input activation and the weights at a second data precision, the second data precision using a larger bit-width than the first data precision; This limitation recites a mathematical calculation
measuring a difference between the first output and the second output to compute the AQS value of the convolution layer; This limitation recites the mathematical concept of measuring a difference between two values
Step 2A Prong 2: This judicial exception is not integrated into a practical application. See analysis of claim 1.
Step 2B: The claim does not contain significantly more than the judicial exception. See analysis of claim 1.

Claim 11
Step 1: A process, as above.
Step 2A Prong 1: The claim recites:
wherein grouping the one or more convolution layers into the quantization group further comprises: grouping operation (OP) layers that share a same input activation; This limitation merely further limits the grouping limitation of claim 1. The grouping limitation is still mentally performable as one can mentally group layers that share a same input activation.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. See analysis of claim 1.
Step 2B: The claim does not contain significantly more than the judicial exception. See analysis of claim 1.

Claims 12-20
Step 1: The claims recite a system, and are therefore directed to the statutory category of machines.
Step 2A Prong 1: Claims 12-20 recite the same judicial exception as claims 1-2 and 4-10, respectively.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claims further recite “memory to store a neural network” and “processing circuitry coupled to the memory and operative to: [perform the method].” However, these limitations are mere instructions to apply the judicial exception on a generic computer (MPEP 2106.05(f)).
Step 2B: The claims do not contain significantly more than the judicial exception. The memory and processing circuitry limitations amount to mere instructions to apply the judicial exception on a generic computer (MPEP 2106.05(f)) as stated above. As an ordered whole, the claims are directed to the abstract idea of performing mathematical calculations to determine optimal bit-widths for a neural network model. Nothing in the claim provides significantly more than this. As such, the claim is not patent eligible.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 4, 9, 14, and 19 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Claims 4 and 14 recite the elements "the number of MAC operations,” “the bit-width of input activation” and “the bit-width of weights.” There is insufficient antecedent basis for these elements. The examiner recommends “a number of MAC operations,” “a bit-width of input activation,” and “a bit-width of weights.”

Claims 9 and 19 recite the element “the weights at step (t-1)” which has insufficient antecedent basis. The examiner recommends “weights at step (t-1).”

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1-5, 9-10, 12-15 and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang (CN113962365A) in view of Fan et al. (HFPQ: deep neural network compression by hardware-friendly pruning-quantization) (“Fan”), and further in view of Yao et al. (HAWQ-V3: Dyadic Neural Network Quantization) (“Yao”).

Regarding claim 1, Zhang discloses “A method for determining bit-widths for mixed-precision neural network computing on a target hardware platform, comprising: 
calculating an activation quantization sensitivity (AQS) value for each of a plurality of… layers in a neural network, wherein the AQS value indicates sensitivity of… output to quantized… input (Zhang, [n0011]: “determining the quantization sensitivity parameter corresponding to each network layer in the first i blocks under the quantization configuration, wherein the quantization sensitivity parameter corresponding to a network layer is used to characterize the difference between the output result before quantization and the output result after quantization of the network layer”; the examiner notes that “quantization of the network layer” corresponds to quantized input, see [n0114]: “perform quantization processing on the input image, weight parameters and output image of each network layer”);
forming a plurality of quantization groups by grouping one or more… layers into a quantization group (Zhang, [n0007]: “The neural network includes multiple blocks, each block including at least one network layer. The method includes: acquiring a target quantization configuration set for a group of blocks in the neural network, the group of blocks including at least one block; filtering a candidate quantization configuration set for another group of blocks in the neural network based on the acquired target quantization configuration set for the group of blocks to obtain a target quantization configuration set for the other group of blocks”; the examiner notes that the groups of blocks correspond to “a plurality of quantization groups”) to be executed by a corresponding set of target hardware (Zhang, [n0004]: “One feasible approach to deploying neural networks in resource-constrained scenarios is to quantize the full-precision parameters in the neural network into lower precision, so as to use less bit width to store the parameters, thereby reducing memory requirements. Hardware devices can support quantization methods with various bit widths”); 
calculating a group AQS value for… [a] quantization group based on AQS values of the… layers in the quantization group…” (Zhang, [n0012]: “summing the quantization sensitivity parameters of each network layer in the first i blocks under the quantization configuration to obtain the estimated value of the quantization sensitivity parameter of the first i blocks under the quantization configuration”; the examiner notes that the “first i blocks” correspond to a quantization group and “the quantization sensitivity parameter of the first i blocks” corresponds to a group AQS value).
Zhang does not appear to explicitly disclose calculating a group AQS value for each quantization group, or that the layers are convolution layers.
However, Fan discloses “calculating a group… [sensitivity] value for each… group [of convolution layers]” (Fan, Fig.3: Sensitivity analysis of VGG16. The 13 convolutional layers of VGG16 are divided into 5 large blocks. Each block contains two or three convolutional layers. The sensitivity of different blocks or convolutional layers is indicated by the color shade. The darker the color, the larger is the sensitivity. The five blocks in this figure are sorted by sensitivity to Conv2B > Conv3B > Conv1B > Conv5B > Conv4B. For block 3, the sensitivity of the convolutional layer is in the order of Conv5 > Conv6 > Conv7).
Fan and the instant application both relate to neural network quantization and are analogous. It would have been obvious to one of ordinary skill in the art, prior to the effective filing date of the claimed invention, to have modified the neural network layers of Zhang to be convolution layers, and to have modified the calculating a group AQS value step of Zhang to be performed for each quantization group, as disclosed by Fan, and one would have been motivated to do so for the purpose of minimizing accuracy loss of a deep neural network after compression, and to make the neural network better implemented in hardware (see Fan, section 3, paragraphs 1-2).
Neither Zhang nor Fan appear to explicitly disclose “selecting bit-widths supported by the target hardware platform for corresponding quantization groups to optimize, under a given constraint, a sensitivity metric that is calculated based on each quantization group’s group AQS value.”
However, Yao discloses “selecting bit-widths supported by the target hardware platform for corresponding… [layers] to optimize, under a given constraint, a sensitivity metric that is calculated based on each… [layer’s] AQS value” (Yao, 3.4: “We can use an Integer Linear Programming (ILP) problem to formalize the problem definition of finding the bit-precision setting that has optimal trade-off as described next. Assume that we have B choices for quantizing each layer (i.e., 2 for INT4 or INT8). For a model with L layers, the search space of the ILP will be BL. The goal of solving the ILP problem is to find the best bit configuration among these BL possibilities that results in optimal trade-offs between model perturbation Ω, and user-specified constraints such as model size, BOPS, and latency. Each of these bit-precision settings could result in a different model perturbation. To make the problem tractable, we assume that the perturbations for each layer are independent of each other (i.e.,                         
                            Ω
                             
                            =
                             
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                        L
                                    
                                            Ω
                                        
                                            i
                                        
                                                            b
                                                        
                                                            i
                                                        
                    , where                         
                            
                                    Ω
                                
                                    i
                                
                                                    b
                                                
                                                    i
                                                
                     is the i-th layer’s perturbation with                         
                            
                                    b
                                
                                    i
                                
                     bit). This allows us to precompute the sensitivity of each layer separately, and it only requires BL computations. For the sensitivity metric, we use the Hessian based perturbation proposed in (Dong et al., 2020, Eq. 2.11). The ILP problem tries to find the right bit precision that minimizes this sensitivity, as follows: [see equations 8-11]”; the examiner notes that                         
                            
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                        L
                                    
                                            Ω
                                        
                                            i
                                        
                                                            b
                                                        
                                                            i
                                                        
                    corresponds to “a sensitivity metric that is calculated based on each layer’s AQS value” as it is a summation of the quantization sensitivity (AQS) of each layer).
Yao and the instant application both relate to hardware-aware mixed-precision quantization and are analogous. It would have been obvious to one of ordinary skill in the art, prior to the effective filing date of the claimed invention, to have modified the combination of Zhang and Fan to include the step of “selecting bit-widths supported by the target hardware platform for corresponding quantization groups to optimize, under a given constraint, a sensitivity metric that is calculated based on each quantization group’s group AQS value” as disclosed by Yao, using the quantization groups and group AQS values disclosed by Zhang/Fan instead of individual layers and individual AQS values, and one would have been motivated to do so for the purpose of balancing the trade-off between model perturbation and other constraints, such as memory footprint and latency (see Yao, Abstract).

Regarding claim 2, the rejection of claim 1 is incorporated. Zhang as modified by Fan discloses “group AQS values” but does not appear to explicitly disclose the further limitations of the claim.
However, Yao further discloses “performing Integer Linear Programming (ILP) to obtain a mixed-precision quantization configuration for the neural network that minimizes a total AQS under the given constraint, wherein the total AQS is a sum of all… [layer] AQS values in the neural network” (Yao, 3.4: “We can use an Integer Linear Programming (ILP) problem to formalize the problem definition of finding the bit-precision setting that has optimal trade-off as described next. Assume that we have B choices for quantizing each layer (i.e., 2 for INT4 or INT8). For a model with L layers, the search space of the ILP will be BL. The goal of solving the ILP problem is to find the best bit configuration among these BL possibilities that results in optimal trade-offs between model perturbation Ω, and user-specified constraints such as model size, BOPS, and latency. Each of these bit-precision settings could result in a different model perturbation. To make the problem tractable, we assume that the perturbations for each layer are independent of each other (i.e.,                         
                            Ω
                             
                            =
                             
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                        L
                                    
                                            Ω
                                        
                                            i
                                        
                                                            b
                                                        
                                                            i
                                                        
                    , where                         
                            
                                    Ω
                                
                                    i
                                
                                                    b
                                                
                                                    i
                                                
                     is the i-th layer’s perturbation with                         
                            
                                    b
                                
                                    i
                                
                     bit). This allows us to precompute the sensitivity of each layer separately, and it only requires BL computations. For the sensitivity metric, we use the Hessian based perturbation proposed in (Dong et al., 2020, Eq. 2.11). The ILP problem tries to find the right bit precision that minimizes this sensitivity, as follows: [see equations 8-11]; the examiner notes that                         
                            
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                        L
                                    
                                            Ω
                                        
                                            i
                                        
                                                            b
                                                        
                                                            i
                                                        
                    corresponds to “a total AQS” as it is a summation of the quantization sensitivity (AQS) of each layer).
Yao and the instant application both relate to hardware-aware mixed-precision quantization and are analogous. It would have been obvious to one of ordinary skill in the art, prior to the effective filing date of the claimed invention, to have modified the combination of Zhang and Fan to include the step of “performing Integer Linear Programming (ILP) to obtain a mixed-precision quantization configuration for the neural network that minimizes a total AQS under the given constraint, wherein the total AQS is a sum of all group AQS values in the neural network” as disclosed by Yao, using the group AQS values disclosed by Zhang/Fan instead of individual AQS values, and one would have been motivated to do so for the purpose of balancing the trade-off between model perturbation and other constraints, such as memory footprint and latency (see Yao, Abstract).

Regarding claim 3, the rejection of claim 2 is incorporated. Zhang as modified by Fan and Yao
further discloses “wherein the given constraint is one of a model size, latency, and total binary operations of the neural network” (Yao, 3.4: “…and user-specified constraints such as model size, BOPS, and latency” and Yao, 3.4: “Note that it is not necessary to set all these constraints at the same time. Typically, which constraint to use depends on the end-user application”).
Yao and the instant application both relate to hardware-aware mixed-precision quantization and are analogous. It would have been obvious to one of ordinary skill in the art, prior to the effective filing date of the claimed invention, to have modified the combination of Zhang and Fan to include selecting bit-widths to optimize a sensitivity metric under a given constraint, the given constraint being one of a model size, latency, and total binary operations of the neural network as disclosed by Yao, and one would have been motivated to do so for the purpose of balancing the trade-off between model perturbation and other constraints, such as memory footprint and latency (see Yao, Abstract).

Regarding claim 4, the rejection of claim 1 is incorporated. Zhang as modified by Fan discloses quantization groups, but does not appear to disclose the further limitations of the claim.
However, Yao further discloses “wherein the given constraint is total binary operations (BOPS), which limits a sum of per-…[layer] BOPS, and wherein per-…[layer] BOPS is calculated as the number of MAC operations multiplied by the bit-width of input activation and the bit-width of weights (Yao, 3.4: equation 10; and Yao, 3.4: “…                        
                            
                                    G
                                
                                    i
                                
                                            b
                                        
                                            i
                                        
                     is the corresponding BOPS required for computing that layer. The latter measures the total Bit Operations for calculating a layer (van Baalen et al., 2020):                         
                            
                                    G
                                
                                    i
                                
                                            b
                                        
                                            i
                                        
                            =
                            
                                    b
                                
                                            w
                                        
                                            i
                                        
                                    b
                                
                                            a
                                        
                                            i
                                        
                            M
                            A
                            
                                    C
                                
                                    i
                                
                    , where                         
                            M
                            A
                            
                                    C
                                
                                    i
                                
                     is the total Multiply-Accumulate operations for computing the i-th layer, and                         
                            
                                    b
                                
                                            w
                                        
                                            i
                                        
                                    ,
                                     
                                    b
                                
                                            a
                                        
                                            i
                                        
                     are the bit precisions used for weight and activation”; the examiner notes that equation 10 depicts limiting a sum of                         
                            
                                    G
                                
                                    i
                                
                                            b
                                        
                                            i
                                        
                    , which corresponds to “per-layer BOPS” as it represents the BOPS required for computing the i-th layer).
Yao and the instant application both relate to hardware-aware mixed-precision quantization and are analogous. It would have been obvious to one of ordinary skill in the art, prior to the effective filing date of the claimed invention, to have modified the combination of Zhang and Fan to include selecting bit-widths to optimize a sensitivity metric under a given constraint, the given constraint being BOPS, which limits a sum of per-group BOPS, and wherein per-group BOPS is calculated as the number of MAC operations multiplied by the bit-width of input activation and the bit-width of weights, as disclosed by Yao, using the quantization groups disclosed by Zhang/Fan, and one would have been motivated to do so for the purpose of minimizing the model perturbation while observing an application specific constraint on total bit operations (see Yao, Introduction).

	Regarding claim 5, the rejection of claim 1 is incorporated. Zhang as modified by Fan discloses quantization groups, but does not appear to explicitly disclose the further limitations of the claim.
However, Yao further discloses “selecting an input activation bit-width and a weight bit-width for each…[layer] to optimize the sensitivity metric under the given constraint” (Yao, Section 4, Table 1: see Method- HAWQ-V3, Precision- W4/8A4/8, “Also, ‘WxAy’ means weight with x-bit and activation with y-bit, and 4/8 means mixed precision with 4 and 8 bits”; the examiner notes that Section 3.4 discloses selecting bit-widths to optimize the sensitivity metric under the given constraint- see rejection of claim 1).
Yao and the instant application both relate to hardware-aware mixed-precision quantization and are analogous. It would have been obvious to one of ordinary skill in the art, prior to the effective filing date of the claimed invention, to have modified the combination of Zhang and Fan to include selecting an input activation bit-width and a weight bit-width for each quantization group to optimize a sensitivity metric under a given constraint, as disclosed by Yao, using the quantization groups disclosed by Zhang/Fan, and one would have been motivated to do so for the purpose of balancing the trade-off between model perturbation and other constraints, such as memory footprint and latency (see Yao, Abstract). 

	Regarding claim 9, the rejection of claim 1 is incorporated. Zhang as modified by Fan and Yao further discloses “wherein calculating the AQS value further comprises: 
calculating a first output of a convolution layer using input activation and weights at step t; 
(Zhang, [n0102]: “obtaining the second output result of the network layer when only the network layer is quantized by the quantization configuration while other network layers are not quantized”; note that the examiner is interpreting “step t” to be a step at which the network layer is quantized);
calculating a second output of the convolution layer using the input activation and the weights at step (t-1); (Zhang, [n0102]: “obtaining the first output result of the network layer when all network layers are not quantized”; the examiner notes that this output is obtained at step (t-1) because it occurs before quantization of the network layer);
 and measuring a difference between the first output and the second output to compute the AQS value of the convolution layer” (Zhang, [n0102]: “the quantization sensitivity parameter corresponding to a network layer is used to characterize the difference between the output result of the network layer before quantization and the output result after quantization”).

Regarding claim 10, the rejection of claim 1 is incorporated. Zhang as modified by Fan and Yao further discloses “wherein calculating the AQS value further comprises: 
calculating a first output of a convolution layer using input activation and weights at a first data precision (Zhang, [n0102]: “obtaining the second output result of the network layer when only the network layer is quantized by the quantization configuration while other network layers are not quantized”); 
calculating a second output of the convolution layer using the input activation and the weights at a second data precision, the second data precision using a larger bit-width than the first data precision (Zhang, [n0102]: “obtaining the first output result of the network layer when all network layers are not quantized”; the examiner notes that non-quantized layers will have a larger bit-width than quantized layers, see [n0004]: “quantize the full-precision parameters in the neural network into lower precision”); 
and measuring a difference between the first output and the second output to compute the AQS value of the convolution layer” (Zhang, [n0102]: “the quantization sensitivity parameter corresponding to a network layer is used to characterize the difference between the output result of the network layer before quantization and the output result after quantization”). 

Claim 12 is a system claim corresponding to method claim 1. Zhang discloses “A system operative to determine bit-widths for mixed-precision neural network computing on a target hardware platform, comprising: memory to store a neural network; and processing circuitry coupled to the memory and operative to…” (Zhang, [n0027]: “According to a third aspect of the present disclosure, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement a neural network quantization method according to any embodiment of the present disclosure”). The further limitations of the claim correspond to those of claim 1, and the remainder of the rejection follows the same rationale as the rejection of claim 1 above.

Regarding claim 13, the rejection of claim 12 is incorporated. Zhang as modified by Fan discloses quantization groups and “group AQS values,” but does not appear to explicitly disclose the further limitations of the claim. 
However, Yao further discloses “performing Integer Linear Programming (ILP) to obtain the bit-width for each… [layer] that optimizes a total AQS under the given constraint, wherein the total AQS is a sum of all… [layer] AQS values in the neural network” (Yao, 3.4: “We can use an Integer Linear Programming (ILP) problem to formalize the problem definition of finding the bit-precision setting that has optimal trade-off as described next. Assume that we have B choices for quantizing each layer (i.e., 2 for INT4 or INT8). For a model with L layers, the search space of the ILP will be BL. The goal of solving the ILP problem is to find the best bit configuration among these BL possibilities that results in optimal trade-offs between model perturbation Ω, and user-specified constraints such as model size, BOPS, and latency. Each of these bit-precision settings could result in a different model perturbation. To make the problem tractable, we assume that the perturbations for each layer are independent of each other (i.e.,                         
                            Ω
                             
                            =
                             
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                        L
                                    
                                            Ω
                                        
                                            i
                                        
                                                            b
                                                        
                                                            i
                                                        
                    , where                         
                            
                                    Ω
                                
                                    i
                                
                                                    b
                                                
                                                    i
                                                
                     is the i-th layer’s perturbation with                         
                            
                                    b
                                
                                    i
                                
                     bit). This allows us to precompute the sensitivity of each layer separately, and it only requires BL computations. For the sensitivity metric, we use the Hessian based perturbation proposed in (Dong et al., 2020, Eq. 2.11). The ILP problem tries to find the right bit precision that minimizes this sensitivity, as follows: [see equations 8-11]”; the examiner notes that                         
                            
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                        L
                                    
                                            Ω
                                        
                                            i
                                        
                                                            b
                                                        
                                                            i
                                                        
                    corresponds to “a total AQS” as it is a summation of the quantization sensitivity (AQS) of each layer).
Yao and the instant application both relate to hardware-aware mixed-precision quantization and are analogous. It would have been obvious to one of ordinary skill in the art, prior to the effective filing date of the claimed invention, to have modified the combination of Zhang and Fan to include the step of “performing Integer Linear Programming (ILP) to obtain the bit-width for each quantization group that optimizes a total AQS under the given constraint, wherein the total AQS is a sum of all group AQS values in the neural network” as disclosed by Yao, using the quantization groups and group AQS values disclosed by Zhang/Fan instead individual layers and AQS values, and one would have been motivated to do so for the purpose of balancing the trade-off between model perturbation and other constraints, such as memory footprint and latency (see Yao, Abstract).

	Regarding claim 14, the rejection of claim 12 is incorporated. Claim 14 is a system claim corresponding to method claim 4 and is rejected for the same reasons as given in the rejection of claim 4 above. 

	Regarding claim 15, the rejection of claim 12 is incorporated. Claim 15 is a system claim corresponding to method claim 5 and is rejected for the same reasons as given in the rejection of claim 5 above.

	Regarding claim 19, the rejection of claim 12 is incorporated. Claim 19 is a system claim corresponding to method claim 9 and is rejected for the same reasons as given in the rejection of claim 9 above.

	Regarding claim 20, the rejection of claim 12 is incorporated. Claim 20 is a system claim corresponding to method claim 10 and is rejected for the same reasons as given in the rejection of claim 10 above.

Claims 6 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang in view of Fan and Yao, and further in view of Qadeer et al. (US20210174172) (“Qadeer”).

Regarding claim 6, the rejection of claim 1 is incorporated. Zhang as modified by Fan and Yao further discloses “identifying one or more quantization groups to lower corresponding bit-widths under a[n]… accuracy criterion” (Zhang, [n0007]: “when the group of blocks is quantized using the quantization configurations in the target quantization configuration set for the group of blocks, the boundary values of the target performance parameters of the neural network do not exceed a preset parameter range corresponding to the target performance parameters” and Zhang, [n0016]: “Optionally, the target performance parameter includes at least one of the following: the output accuracy of the quantized neural network…”; the examiner notes that quantizing the group of blocks reads on lowering corresponding bit-widths, see [n0004]: “quantize the full-precision parameters in the neural network into lower precision”).
Zhang as modified by Fan and Yao does not appear to explicitly disclose that the accuracy criterion is “user-defined.”
However, Qadeer discloses a “user-defined accuracy criterion” (Qadeer, [0025]: “Generally, the system can access or receive as input from a user a loss-of-accuracy threshold in association with a floating-point network. The loss-of-accuracy threshold can define an accuracy (as a value of the accuracy measure), a proportion of an original accuracy of the floating-point network that is desired for the quantized network, or an accuracy interval around the original accuracy of the floating-point network”).
Qadeer and the instant application both relate to quantization of neural networks and are analogous. It would have been obvious to one of ordinary skill in the art, prior to the effective filing date of the claimed invention, to have modified the accuracy criterion disclosed by the combination of Zhang, Fan, and Yao, to be user-defined as disclosed by Qadeer, and one would have been motivated to do so for the purpose of generating a quantized network that satisfies a user’s desired accuracy (see Qadeer, [0027]).

Regarding claim 16, the rejection of claim 12 is incorporated. Claim 16 is a system claim corresponding to method claim 6, and is rejected for the same reasons as given in the rejection of claim 6 above.

Claims 8 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang in view of Fan and Yao, and further in view of Li et al. (US20230196197) (“Li”).

Regarding claim 8, the rejection of claim 1 is incorporated. Zhang as modified by Fan and Yao discloses the group AQS value, and AQS values of convolution layers in the quantization group, but does not appear to explicitly disclose “wherein the group AQS value is the maximum AQS value of all convolution layers in the quantization group.”
However, Li discloses “wherein the group… [sequence number] value is the maximum… [sequence number] value of all… layers in the…group” (Li, [0012]: “In some embodiments, the determining the group with the least cost value from the groups with the group available resources greater than or equal to the layer required resources of the target layer as the group in which the target layer is located includes: determining the maximum sequence number of each group; the maximum sequence number of any of the groups being the maximum value of the sequence numbers of all to-be-calibrated layers in the group, and the sequence number of any of the to-be-calibrated layers being the order of the to-be-calibrated layer in the model according to the preset processing sequence”).
Li and the instant application both relate to quantization of neural networks and are analogous. It would have been obvious to one of ordinary skill in the art, prior to the effective filing date of the claimed invention, to have modified the combination of Zhang, Fan, and Yao to have the group AQS value be the maximum AQS value of all convolution layers in the quantization group, in the manner disclosed by Li, and one would have been motivated to do so for the purpose of reducing the number of calibration operations (the examiner notes that “calibration” corresponds to determining the quantization factor, see [0004]) to increase calculation speed during the calibration of a model (see Li, Abstract).

Regarding claim 18, the rejection of claim 12 is incorporated. Claim 18 is a system claim corresponding to method claim 8, and is rejected for the same reasons as given in the rejection of claim 8 above.

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Zhang in view of Fan and Yao, and further in view of Wolf et al. (US20200279393) (“Wolf”).

Regarding claim 11, the rejection of claim 1 is incorporated. Zhang as modified by Fan and Yao does not appear to disclose the further limitations of the claim. 
However, Wolf discloses “grouping operation (OP) layers that share a same input activation” (Wolf, [0046-0047]: “…when the CNNs in the object detectors exhibit layers with groups of layers with identical characteristics (i.e., exhibit the same filters sizes, strides and activation functions and are ordered the same), then the CNNs of the objects detectors are configured to have common layers. The term ‘group of layers with identical characteristics’ herein above and below relate to groups of layers, where the layers in each group exhibits the same filters sizes, strides and activation functions and the layers in the groups are ordered the same”).
Wolf and the instant application both relate to neural networks and are analogous. It would have been obvious to one of ordinary skill in the art, prior to the effective filing date of the claimed invention, to have modified the grouping the one or more convolution layers into the quantization group disclosed by the combination of Zhang, Fan, and Yao, to include grouping operation layers that share a same input activation as disclosed by Wolf, and one would have been motivated to do so for the purpose of grouping similar layers to reduce the number of computations needed to be performed (see Wolf, Abstract).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to GWYNEVERE A DETERDING whose telephone number is (571)272-7657. The examiner can normally be reached Mon-Fri. 7:30am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached at (571) 272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/G.A.D./Examiner, Art Unit 2125                                                                                                                                                                                                        

/KAMRAN AFSHAR/Supervisory Patent Examiner, Art Unit 2125
Read full office action
Prosecution Timeline

Jun 29, 2022
Application Filed
Jan 30, 2026
Non-Final Rejection — §101, §103, §112 (current)
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds
Prosecution Projections

1-2
Expected OA Rounds
100%
Grant Probability
99%
With Interview (+100.0%)
3y 3m
Median Time to Grant
Low
PTA Risk
Based on 2 resolved cases by this examiner. Grant probability derived from career allow rate.
Hardware-Aware Mixed-Precision Quantization

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Hardware-Aware Mixed-Precision Quantization

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email