Last updated: May 04, 2026
Application No. 17/090,542
SYSTEMS AND METHODS FOR AUTOMATIC MIXED-PRECISION QUANTIZATION SEARCH

Non-Final OA §103
Filed
Nov 05, 2020
Priority
Oct 14, 2020 — provisional 63/091,690
Examiner
NGUYEN, TRI T
Art Unit
2128
Tech Center
2100 — Computer Architecture & Software
Assignee
Samsung Electronics Co., Ltd.
OA Round
5 (Non-Final)
Interview Optional

— +13.2% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 68% grant rate with +13.2% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 183 resolved cases, 2023–2026
Examiner Intelligence

NGUYEN, TRI T View full profile →
Grants 68% — above average
Career Allowance Rate
125 granted / 183 resolved
+13.3% vs TC avg
Moderate +13% lift
Without
With
+13.2%
Interview Lift
resolved cases with interview
Typical timeline
3y 10m
Avg Prosecution
31 currently pending
Career history
214
Total Applications
across all art units
Statute-Specific Performance

§101
15.6%
-24.4% vs TC avg
§103
57.8%
+17.8% vs TC avg
§102
7.1%
-32.9% vs TC avg
§112
14.1%
-25.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 183 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 12/22/2025 has been entered.

Response to Amendment
The amendment filed 11/11/2025 has been entered. Claims 1-20 remain pending in the application. 

Response to Arguments
Applicant’s arguments, filed 11/11/2025, with respect to the rejections of the claims under 103 have been addressed on the Advisory Action mailed 12/22/2025. Also, because of the claim amendments, a new ground(s) of rejection is made in view of Wu et al. (NPL: Mixed Precision Quantization of ConvNets via Differentiable Neural Architecture Search) in view of Nachum et al. (US Pub. 2019/0147339) and further in view of Saito et al. (US Pub. 2007/0245294).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Wu et al. (NPL: Mixed Precision Quantization of ConvNets via Differentiable Neural Architecture Search) in view of Nachum et al. (US Pub. 2019/0147339) and further in view of Saito et al. (US Pub. 2007/0245294).
As per claim 1, Wu teaches a machine learning method using a machine learning model residing on an electronic device, the method comprising [page 1, abstract, “Recent work in network quantization has substantially reduced the time and space complexity of neural network inference, enabling their deployment on embedded and mobile devices with limited computational and memory resources”; Examiner's Note: Neural networks are interpreted as a type of machine learning that executes on the recited electronic devices with limited computational and memory resources. Also, see Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, “FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"]: 
receiving an inference request by the electronic device [Fig. 1, page 3, section 4.2, “The operator takes the data tensor at vi as its input”; Examiner's Note: Wu recites a data tensor input that examiner interprets as an inference request because input to the machine learning model will generate an inference]; 
determining, using the machine learning model, an inference result for the inference request using a selected inference path in the machine learning model [Fig. 1, "Executed edges" and "Not executed edges"; page 3, section 4.2, “The operator takes the data tensor at vi as its input and computes its output as eijk (vi; wijk)”; Examiner's Note: Wu teaches "Executed edges" that is considered to correspond to using a selected inference path. Also, see Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, “FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"], wherein:
the selected inference path is selected based on a highest probability that a result is accurate for each layer of the machine learning model [Figs. 1 and 2, “Edge Probability Pϴ1,2”; page 2, “the probability of execution is parameterized by some architecture parameters ϴ”; page 4, section 4.3, “To address this issue, we use Gumbel Softmax proposed by Jang et al. (2016); Maddison et al. (2016) to control the edge selection”; page 5, section 4.4, “we train the architecture parameter ϴ, to increase the probability to sample those edges with better performance, and to suppress those with worse performance”; Examiner's Note: As taught in Equation 4 on page 4 of Wu, edge probability Pϴij is calculated with the Soft max function performed on ϴ to "control edge selection" in order to increase the probability to sample edges with better performance, and to suppress those with worse performance based on a highest probability for each layer. See also Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, "FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"; paragraph [0098] of the present applicant recites, "The processor, may select the path having the highest probability" and recites using the Gumbel Soft max function previously disclosed by Wu]; and 
a size of the machine learning model is reduced corresponding to constraints imposed by the electronic device [page 1, abstract, “Recent work in network quantization has substantially reduced the time and space complexity of neural network inference, enabling their deployment on embedded and mobile devices with limited computational and memory resources”; page 6, section 6.1, “We start by focusing on reducing model size, since smaller models require less storage and communication cost, which is important for mobile and embedded devices”; page 5, section 4.4, “Therefore, we train the architecture parameter ϴ, to increase the probability to sample those edges with better performance, and to suppress those with worse performance”; Examiner's Note: Wu is considered to suppress its edges with worse performance to effectively prune its model to enable it to run on an electronic device. By pruning the machine learning model, its size is reduced corresponding to the constraints imposed by the electronic device so it can run on the electronic device. See also Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, "FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"; Paragraph [0095] “In some embodiments, paths or edges that are not selected can be pruned from the model to further decrease the size and increase the speed of the model”]; and 
executing an action in response to the inference result [page 1, section 1.1, “ConvNets have become the de-facto method in a wide range of computer vision tasks”; Examiner's Note: Thus, one action executed in response to the inference result after performing a computer vision application, such as pattern recognition for identification of objects, is displaying the results of the pattern recognition. See also Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, "FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"];
wherein the machine learning model is trained by:
splitting parameters of the machine learning model into groups, wherein each group is associated with a specified layer of the machine learning model [page 3, section 3, “Normally 32-bit (full-precision) floating point numbers are used to represent weights and activations of neural nets. Quantization projects full-precision weights and activations to fixed-point numbers with lower bit-width, such as 8, 4, and 1 bit. We follow DoReFa-Net (Zhou et al. (2016)) to quantize weights and PACT (Choi et al. (2018)) to quantize activations”; page 6, section 6.1, “We start by focusing on reducing model size, since smaller models require less storage and communication cost, which is important for mobile and embedded devices”; Examiner's Note: As shown in Figure 1 of Wu, each layer is associated with a separate group of para meters e.g., v, e, and P; page 2, Fig. 1 discloses “Each layer of the super net contains several parallel edges representing convolution operators with quantized weights and activations with different precisions”; See aIso Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, " FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"];
for each group, searching for a respective quantization bit providing a highest measured probability [page 3, section 3, “Normally 32-bit (full-precision) floating point numbers are used to represent weights and activations of neural nets. Quantization projects full-precision weights and activations to fixed-point numbers with lower bit-width, such as 8, 4, and 1 bit. We follow DoReFa-Net (Zhou et al. (2016)) to quantize weights and PACT (Choi et al. (2018)) to quantize activations”; page 5, section 5, We use the DNAS framework to solve the mixed precision quantization problem – deciding the optimal layer-wise precision assignment; Fig. 2, One layer of a super net for mixed precision quantization of a ConvNet. Nodes in the super net represent feature maps, edges represent convolution operators with different bit-widths; Examiner's Note: Figure 2 of Wu shows the use of "Edge Probability" in mixed precision quantization. Wu teaches using edge probability to replace 32-bit floating point numbers with fixed-point numbers (integers). See also Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, " FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"]; and
including the constraints within a loss function for backpropagation through the machine learning model [page 2, “we need to back propagate gradients through discrete random variables that control the stochastic edge execution”; page 4, section 4.3, “the gradient of the loss function with respect to ϴ can be computed as 

    PNG
    media_image1.png
    52
    546
    media_image1.png
    Greyscale
            (8)”; 
the equation (8) above discloses the loss function L (ma, wa); 
wherein, page 5, section 5, “we define the loss function as 
L(a, wa) = CrossEntropy(a) x C(Cost(a))”, 
and, page 6, section 5, recite “

    PNG
    media_image2.png
    46
    430
    media_image2.png
    Greyscale

where #FLOP(.) denotes the number of floating point operations (speed constraint, for example, less operations result to increasing the speed of the model)
and, 
“To compress the model size, we define the cost as

    PNG
    media_image3.png
    48
    358
    media_image3.png
    Greyscale
”
	Where, #PARAM(.) denotes the number of parameters of a convolution operator and weight-bit(.) denotes the bit-width of the weight.”
Based on the citing above, it can be seen that the loss function including the constraints such as size and/or speed constraints, etc.,
See also Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, " FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"]. 
Paragraph 0095 of the specification of the current Application recites “because of the fully back propagating nature of the model 900, one or more constraints can be added into the loss function such as a size constraint, an accuracy constraint, and/or an inference speed constraint … As a particular example, a loss function with an added size constraint and inference speed constraint might be expressed as follows:

    PNG
    media_image4.png
    30
    256
    media_image4.png
    Greyscale

FLOPs is the measurement of how many calculations are needed for an inference.
Wu does not teach
at least one of the constraints included in the loss function is prioritized over at least one other of the constraints included in the loss function, and wherein the at least one constraint is prioritized based on available resources for the electronic device via an initial setting of one or more values of the constraints in relation to each other within the loss function. 
Nachum teaches
at least one of the constraints included in the loss function is prioritized over at least one other of the constraints included in the loss function, and wherein the at least one constraint is prioritized based on available resources for the electronic device [paragraphs 0011-0013, “adaptively adjusting weights of one or more terms of the shrinking engine loss function that penalize active neurons of the neural network during training comprises: determining that a constraint is not met at a particular training iteration; paragraph 0018, “the set of one or more constraints include one or more of: a constraint on a maximum number of active neurons in the neural network; a constraint on a maximum inference latency of the neural network; a constraint on maximum power consumption of the neural network; a constraint on a maximum memory footprint of the neural network”; paragraph 0072, “in response to determining that a particular constraint of the set of constraints is not satisfied by the neural network during the training iteration, the engine may increase the values of the particular A-factors of the Joss function that penalize active neurons of the neural network. For example, in response to determining that a constraint on the maximum number of active neurons in the neural network is not met (e.g., because the number of active neurons in the neural network exceeds the maximum number specified by the constraint), the engine may uniformly increase (e.g., multiply by a fixed scaling factor) the values of the particular A-factors. As another example, in response to determining that a constraint on the maximum inference latency of the neural network is not met (e.g., because the inference latency of the neural network exceeds the maximum inference latency specified by the constraint), the engine may increase the values of the particular A-factors such that the increase in the value of the A-factor of a term is based on the number of operations induced by the neuron corresponding to the term. Specifically, the increase in the values of the A-factors of terms that correspond to neurons which induce a larger number of operations may be greater than the increase in the A-factors of terms that correspond to neurons which induce a smaller number of operations. Since the number of operations induced by the neurons of the neural network is directly related to the inference latency of the neural network, adaptively increasing the particular A-factors in this manner may contribute to the neural network satisfying the maximum inference latency constraint in future training iterations.”; It can be seen that the constraint on the latency is not met because the system needs to perform a large number of operations induce by a large number of neurons (for example, the number of active neurons exceeds the maximum number specified by the constraint). So, when either a constraint on the maximum number of active neurons in the neural network or a constraint on the maximum inference latency of the neural network is not met, the number of active neurons is adjusted (adjusting the weight factor that penalize active neurons/setting the weight to zero), and therefore, the constraint on the maximum number of active neurons is prioritized over the constraint on the maximum inference latency, because, limiting the number of active neurons in the neural network would reduce the number of operations induce by the neurons which could result in increasing the speed of training the network];
Wu in abstract, introduction and pages 2-4, teaches the method for finding an optimal assignment of precisions to reduce the model size, Wu also teaches a loss function comprises the constraints such as size constraint or speed constraint, etc., Wu, however, is silent of “at least one of the constraints included in the loss function is prioritized over at least one other of the constraints included in the loss function”. Nachum is added to fill in this missing element.
Nachum in Fig. 3, paragraphs 0011-0013, 0018 and 0072 teaches the loss function comprising the constraints, wherein, "the constraint is on a maximum number of active neurons in the neural network, the constraint is on a maximum inference latency of the neural network, a constraint on maximum power consumption of the neural network, etc.,"
Nachum further teaches when “a constraint on the maximum number of active neurons in the neural network is not met (e.g., because the number of active neurons in the neural network exceeds the maximum number specified by the constraint)”, or when “a constraint on the maximum inference latency of the neural network is not met”, the number of active neurons, which is one of the constraints, is adjusted, because the latency depends on the number of neurons, reducing the number of active neurons would help solving both the maximum number of active neurons constraint and the maximum inference latency constraint, thus, the number of active neurons constraint is prioritized over other constraints (inference latency constraint for example)
The combination of Wu and Nachum teaches the loss function comprises multiple constraints, wherein, one of the constraints included in the loss function (the number of active neurons constraint) is prioritized over other constraints (adjusting the number of active neurons when a certain constraint is not met) based on available resources for the electronic device such as model size or storage space.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the quantization method of Wu to include at least one of the constraints included in the loss function is prioritized over at least one other of the constraints via a modification of one or more values of the constraints in relation to each other within the loss function of Nachum. Doing so would help adjusting the parameters associated with the loss function so that a set of one or more constraints are satisfied (Nachum, 0073).
Wu and Nachum do not teach
at least one of the constraints is prioritized over at least one other of the constraints, wherein the at least one constraint is prioritized via an initial setting of one or more values of the constraints in relation to each other.
Saito teaches
at least one of the constraints is prioritized over at least one other of the constraints, wherein the at least one constraint is prioritized via an initial setting of one or more values of the constraints in relation to each other [paragraph 0007, “in the technical field of embedded computers, it is essential to build a system in a short term while satisfying constraints such as circuit area size, timing, performance, and power consumption”; paragraph 0065, “accepts an input by the user concerning priorities of the respective constraints such as the power consumption, the circuit area size, and the processing time. Specifically, in the case where there are plural constraints, setting priorities to the respective constraints … if there are two constraints i.e. the processing time and the power consumption, the priority of the processing time can be set higher than the priority of the power consumption. If there are constraints that the processing time is 0.5 second or shorter, and the power consumption is 2W or less, the judger 17 judges that the target system satisfies the constraint, despite that the specification value calculator 16 calculates that the processing time required for the entirety of the target system is 0.4 second, and the power consumption is 3 W. In other words, since the priority of the processing time is higher than the priority of the power consumption, the judger 17 judges that the target system satisfies the constraint, as far as the calculated processing time satisfies the constraint concerning the processing time, although the calculated power consumption does not satisfy the constraint concerning the power consumption”; It can be seen that the processing time constraint is prioritized over the power consumption constraint, and the initial setting of the constraints are .4 second and 2W respectively].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the quantization method of Wu to include at least one of the constraints is prioritized over at least one other of the constraints, wherein the at least one constraint is prioritized via an initial setting of one or more values of the constraints of Saito. Doing so would help determining whether the constraints are satisfied based on the priority and the setting values of the constraints (Saito, 0065).

As per claim 2, Wu, Nachum and Saito teach the method of Claim 1.
Wu further teaches
training the machine learning model reduces the size of the machine learning model [page 1, abstract, “Recent work in network quantization has substantially reduced the time and space complexity of neural network inference, enabling their deployment on embedded and mobile devices with limited computational and memory resources”; page 2, “Quantizing weights can reduce the model size of the network and therefore reduce storage space and over-the-air communication cost."; page 6, 6.1, “We start by focusing on reducing model size, since smaller models require less storage and communication cost, which is important for mobile and embedded devices”; page 2, section 2, “The super net is trained and edges with the highest coefficients are kept to form the child network”; page 6, 6.1, “1) All of our most accurate models ... still achieves 11.6 -12.5X model size reduction. 2) Our most efficient models can achieve 16.6 - 20.3X model size compression ...”; Examiner's Note: Wu teaches reducing machine learning model size by training. See also Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, “FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"];
the parameters of the machine learning model include floating point values [page 3, section 3, “Normally 32-bit (full-precision) floating point numbers are used to represent weights and activations of neural nets”]; and
the quantization bit of each group is used to replace the floating point values of the parameters of each group with integer values [page 3, section 3, “Normally 32-bit (full-precision) floating point numbers are used to represent weights and activations of neural nets. Quantization projects full-precision weights and activations to fixed-point numbers with lower bit-width, such as 8, 4, and 1 bit”; page 5, section 5, “We use the DNAS framework to solve the mixed precision quantization problem – deciding the optimal layer-wise precision assignment”; Fig. 2, “One layer of a super net for mixed precision quantization of a ConvNet. Nodes in the super net represent feature maps, edges represent convolution operators with different bit-widths”].

As per claim 3, Wu, Nachum and Saito teach the method of Claim 2.
Wu further teaches
each respective quantization bit comprises a bit value [page 3, section 3, “Normally 32-bit (full-precision) floating point numbers are used to represent weights and activations of neural nets. Quantization projects full-precision weights and activations to fixed-point numbers with lower bit-width, such as 8, 4, and 1 bit”; page 6, Fig. 2, “One layer of a super net for mixed precision quantization of a ConvNet. Nodes in the super net represent feature maps, edges represent convolution operators with different bit-widths”; Examiner's Note: Wu teaches quantization of different bit values. See also Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, “FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"]; 
searching for the respective quantization bit comprises performing mixed bit quantization [page 9, section 7, “In this work we focus on the problem of mixed precision quantization of a ConvNet to determine its layer-wise bit-widths”; page 1, section 1, “For a ConvNet with N layers and M candidate precisions in each layer, we want to find an optimal assignment of precisions to minimize the cost in terms of model size, memory footprint or computation, while keeping the accuracy”; page 1, “The idea is illustrated in Fig. 1. The problem of neura I architecture search (NAS) aims to find the optimal neural net architecture in a given search space”; Examiner's Note: Wu teaches searching to find "optimal" mixed bit quantization. See also Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, “FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"]; and 
performing the mixed bit quantization comprises [page 9, section 7, “In this work we focus on the problem of mixed precision quantization of a ConvNet to determine its layer-wise bit-widths”; page 7, section 6.2, “Similar to the CIFARlO experiments, we conduct mixed precision search at the block level”; Examiner's Note: Wu teaches mixed bit quantization]:
replacing a portion of the floating point values of the parameters for at least one of the groups with integer values corresponding to a first bit value [page 3, section 3, “Normally 32-bit (full-precision) floating point numbers are used to represent weights and activations of neural nets. Quantization projects full-precision weights and activations to fixed-point numbers with lower bit-width, such as 8, 4, and 1 bit … Previous methods use the same precision for all or most of the layers. We expand the design space by choosing different precision assignment from M candidate precisions at N different layers”; Examiner's Note: Some floating point numbers are replaced with fixed-point (integer) numbers in Wu]; and 
replacing another portion of the floating point values of the parameters for the at least one of the groups with integer values corresponding to a second bit value [page 3, section 3, “Normally 32-bit (full-precision) floating point numbers are used to represent weights and activations of neural nets. Quantization projects full-precision weights and activations to fixed-point numbers with lower bit-width, such as 8, 4, and 1 bit … Previous methods use the same precision for all or most of the layers. We expand the design space by choosing different precision assignment from M candidate precisions at N different layers”; Examiner's Note: Some floating point numbers are replaced with fixed-point (integer) numbers in Wu].  

As per claim 4, Wu, Nachum and Saito teach the method of Claim 3.
Wu further teaches
 wherein performing the mixed bit quantization further comprises [page 9, section 7, “In this work we focus on the problem of mixed precision quantization of a ConvNet to determine its layer-wise bit-widths”; page 7, section 6.2, “Similar to the CIFARlO experiments, we conduct mixed precision search at the block level”; Examiner's Note: Wu teaches mixed bit quantization]: 
determining the first bit value and the second bit value based on the searching for the respective quantization bits [page 3, section 3, “Normally 32-bit (full-precision) floating point numbers are used to represent weights and activations of neural nets. Quantization projects full-precision weights and activations to fixed-point numbers with lower bit-width, such as 8, 4, and 1 bit”; page 6, Fig. 2, “One layer of a super net for mixed precision quantization of a ConvNet. Nodes in the super net represent feature maps, edges represent convolution operators with different bit-widths”; page 9, section 7, “In this work we focus on the problem of mixed precision quantization of a ConvNet to determine its layer-wise bit-widths”; page 7, section 6.2, “Similar to the CIFARlO experiments, we conduct mixed precision search at the block level”; Examiner's Note: The Examiner interprets the mixed precision search at the block level of Wu to correspond to determining the first bit value and the second bit value based on the searching for the respective quantization bits. See also Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, “FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"]; and 
assigning the first bit value and the second bit value to the portion of the floating point values and the other portion of the floating point values, respectively, based on the highest measured probability [page 3, section 3, “Normally 32-bit (full-precision) floating point numbers are used to represent weights and activations of neural nets. Quantization projects full-precision weights and activations to fixed-point numbers with lower bit-width, such as 8, 4, and 1 bit”; page 6, Fig. 2, “One layer of a super net for mixed precision quantization of a ConvNet. Nodes in the super net represent feature maps, edges represent convolution operators with different bit-widths”; Examiner's Note: Figure 2 of Wu shows the use of "Edge Probability" in mixed precision quantization. Wu teaches using edge probability to replace 32-bit floating point numbers with fixed-point numbers (integers). See also Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, “FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"].  

As per claim 5, Wu, Nachum and Saito teach the method of Claim 4.
Wu further teaches
the integer values corresponding to the second bit value are zeros [page 6, section 5, “mijk is the edge selection mask … Note that in a candidate architecture, mijk have binary values {0; 1}”; page 4, section 4.2, “we execute edge eijk when mijk is sampled to be 1”; page 3, section 2, “model compression through network pruning”; Examiner's Note: Edge selection mask is either 1, to permit execution, or 0, to prevent execution. Therefore, Wu teaches integer values corresponding to the second bit value are zeros. See also Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, “FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"].  

As per claim 6, Wu, Nachum and Saito teach the method of Claim 5.
Wu further teaches
the size of the machine learning model is further reduced by changing one or more parameters of at least one of the groups into zeros in parallel with searching for the respective quantization bits [page 6, section 5, “mijk is the edge selection mask … Note that in a candidate architecture, mijk have binary values {0; 1}”; page 4, section 4.2, “we execute edge eijk when mijk is sampled to be 1”; page 3, section 2, “model compression through network pruning”; page 7, section 6.2, “Similar to the CIFARlO experiments, we conduct mixed precision search at the block level”; Examiner's Note: The size of the trained machine learning model is reduced by setting edge selection bits to zero untied to quantization, thus capable of being performed in parallel. See also Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, “FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"].  

As per claim 7, Wu, Nachum and Saito teach the method of Claim 2.
Wu further teaches
each layer of the machine learning model comprises a plurality of edges [page 2, section 1, “Each layer of the super net contains several parallel edges representing convolution operators with quantized weights and activations with different precisions”; Examiner's Note: Wu teaches each layer of the model comprises a plurality of edges. See also Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, “FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"]; and  
for each group, searching for the respective quantization bit comprises:
identifying, using back propagation, an edge from among the plurality of edges in one of the layers of the machine learning model, wherein the identified edge is associated with the highest probability [page 2, section 1, “We show that using DNAS to search for layer-wise precision assignments for Res Net models on ClFAR1O and ImageNet, we surpass the state-of-the-art compression”; page 5, section 4.4, “Therefore, we train the architecture parameter ϴ, to increase the probability to sample those edges with better performance, and to suppress those with worse performance”; page 2, section 1, “We solve for the optimal architecture parameter ϴ by training the stochastic super net with SGD with respect to both the network's weights and the architecture parameter 8. To compute the gradient of ϴ, we need to back propagate gradients through discrete random variables that control the stochastic edge execution. To address this, we use the Gumbel Soft Max function (Jang et al. (2016)) to "soft-control" the edges”; Examiner's Note: Wu teaches back propagation for identifying an edge having a highest probability of execution. See also Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, “FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"]; and 
selecting the identified edge for an associated group, wherein the respective quantization bit comprises a bit value associated with the selected identified edge [page 2, section 1, “And for a pair of nodes vi; vj that are connected by Kij candidate edges, we only select one edge”; page 3, section 3, “Normally 32-bit (full-precision) floating point numbers are used to represent weights and activations of neural nets. Quantization projects full-precision weights and activations to fixed-point numbers with lower bit-width, such as 8, 4, and 1 bit”; page 6, Fig. 2, “One layer of a super net for mixed precision quantization of a ConvNet. Nodes in the super net represent feature maps, edges represent convolution operators with different bit-widths”; Examiner's Note: Wu teaches selecting an edge in a layer having quantization. See also Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, “FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"]. 

As per claim 8, Wu, Nachum and Saito teach the method of Claim 1.
Nachum further teaches
the constraints imposed by the electronic device and included in the loss function include two or more of: a size constraint, an inference speed constraint, and an accuracy constraint [paragraphs 0011-0013, “adaptively adjusting weights of one or more terms of the shrinking engine loss function that penalize active neurons of the neural network during training comprises: determining that a constraint is not met at a particular training iteration; paragraph 0018, “the set of one or more constraints include one or more of: a constraint on a maximum number of active neurons in the neural network; a constraint on a maximum inference latency of the neural network; a constraint on maximum power consumption of the neural network; a constraint on a maximum memory footprint of the neural network”; paragraph 0028, “The training system as described in this specification enables the structure and parameters of neural networks to be determined in accordance with performance constraints (e.g. accuracy) and cost constraints (e.g. memory constraints and/or inference latency constraints)”; examiner interprets the constraint on a maximum number of active neurons in the neural network] as a size constraint; and
one of the size constraint, the inference speed constraint, or the accuracy constraint is prioritized over others of the constraints [paragraph 0072, “in response to determining that a particular constraint of the set of constraints is not satisfied by the neural network during the training iteration, the engine may increase the values of the particular A-factors of the Joss function that penalize active neurons of the neural network. For example, in response to determining that a constraint on the maximum number of active neurons in the neural network is not met (e.g., because the number of active neurons in the neural network exceeds the maximum number specified by the constraint), the engine may uniformly increase (e.g., multiply by a fixed scaling factor) the values of the particular A-factors. As another example, in response to determining that a constraint on the maximum inference latency of the neural network is not met (e.g., because the inference latency of the neural network exceeds the maximum inference latency specified by the constraint), the engine may increase the values of the particular A-factors such that the increase in the value of the A-factor of a term is based on the number of operations induced by the neuron corresponding to the term. Specifically, the increase in the values of the A-factors of terms that correspond to neurons which induce a larger number of operations may be greater than the increase in the A-factors of terms that correspond to neurons which induce a smaller number of operations. Since the number of operations induced by the neurons of the neural network is directly related to the inference latency of the neural network, adaptively increasing the particular A-factors in this manner may contribute to the neural network satisfying the maximum inference latency constraint in future training iterations.”; It can be seen that the constraint on the latency is not met because the system needs to perform a large number of operations induce by a large number of neurons (for example, the number of active neurons exceeds the maximum number specified by the constraint), therefore, the constraint on the maximum number of active neurons is prioritized over the constraint on the maximum inference latency, because, limiting the number of active neurons in the neural network would reduce the number of operations induce by the neurons which could result in increasing the speed of training the network]. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the quantization method of Wu to include the constraints in the loss function include two or more of: a size constraint, an inference speed constraint, and an accuracy constraint, and at least one of the constraints included in the loss function is prioritized over at least one other of the constraints included in the loss function of Nachum. Doing so would help adjusting the parameters associated with the loss function that penalizing the neurons/nodes of the neural network (Nachum, 0011).

As per claim 9, Wu teaches an electronic device comprising: at least one memory configured to store a machine learning model; and at least one processor coupled to the at least one memory, the at least one processor configured to [page 1, abstract, “Recent work in network quantization has substantially reduced the time and space complexity of neural network inference, enabling their deployment on embedded and mobile devices with limited computational and memory resources”; Examiner's Note: Neural networks are interpreted as a type of machine learning that executes on the recited electronic devices with limited computational and memory resources. Also, see Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, “FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"]: 
receive an inference request [Fig. 1, page 3, section 4.2, “The operator takes the data tensor at vi as its input”; Examiner's Note: Wu recites a data tensor input that Examiner interprets as an inference request because input to the machine learning model will generate an inference]; 
determine, using the machine learning model, an inference result for the inference request using a selected inference path in the machine learning model [Fig. 1, "Executed edges" and "Not executed edges"; page 3, section 4.2, “The operator takes the data tensor at vi as its input and computes its output as eijk (vi; wijk)”; Examiner's Note: Wu teaches "Executed edges" that is considered to correspond to using a selected inference path. Also, see Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, “FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"], wherein:
the selected inference path is selected based on a highest probability that a result is accurate for each layer of the machine learning model [Figs. 1 and 2, “Edge Probability Pϴ1,2”; page 2, “the probability of execution is parameterized by some architecture parameters ϴ”; page 4, section 4.3, “To address this issue, we use Gumbel Softmax proposed by Jang et al. (2016); Maddison et al. (2016) to control the edge selection”; page 5, section 4.4, “we train the architecture parameter ϴ, to increase the probability to sample those edges with better performance, and to suppress those with worse performance”; Examiner's Note: As taught in Equation 4 on page 4 of Wu, edge probability Pϴij is calculated with the Soft max function performed on ϴ to "control edge selection" in order to increase the probability to sample edges with better performance, and to suppress those with worse performance based on a highest probability for each layer. See also Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, "FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"; paragraph [0098] of the present applicant recites, "The processor, may select the path having the highest probability" and recites using the Gumbel Soft max function previously disclosed by Wu]; and
a size of the machine learning model is reduced corresponding to constraints imposed by the electronic device [page 1, abstract, “Recent work in network quantization has substantially reduced the time and space complexity of neural network inference, enabling their deployment on embedded and mobile devices with limited computational and memory resources”; page 6, section 6.1, “We start by focusing on reducing model size, since smaller models require less storage and communication cost, which is important for mobile and embedded devices”; page 5, section 4.4, “Therefore, we train the architecture parameter ϴ, to increase the probability to sample those edges with better performance, and to suppress those with worse performance”; Examiner's Note: Wu is considered to suppress its edges with worse performance to effectively prune its model to enable it to run on an electronic device. By pruning the machine learning model, its size is reduced corresponding to the constraints imposed by the electron device so it can run on the electronic device. See also Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, "FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"; Paragraph [0095] “In some embodiments, paths or edges that are not selected can be pruned from the model to further decrease the size and increase the speed of the model”]; and
execute an action in response to the inference result [page 1, section 1.1, “ConvNets have become the de-facto method in a wide range of computer vision tasks”; Examiner's Note: Thus, one action executed in response to the inference result after performing a computer vision application, such as pattern recognition for identification of objects, is displaying the results of the pattern recognition. See also Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, "FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"];
wherein, to train the machine learning model, the at least one processor of the electronic device or another electronic device is configured to:
split parameters of the machine learning model into groups, wherein each group is associated with a specified layer of the machine learning model [page 3, section 3, “Normally 32-bit (full-precision) floating point numbers are used to represent weights and activations of neural nets. Quantization projects full-precision weights and activations to fixed-point numbers with lower bit-width, such as 8, 4, and 1 bit. We follow DoReFa-Net (Zhou et al. (2016)) to quantize weights and PACT (Choi et al. (2018)) to quantize activations”; page 6, section 6.1, “We start by focusing on reducing model size, since smaller models require less storage and communication cost, which is important for mobile and embedded devices”; Examiner's Note: As shown in Figure 1 of Wu, each layer is associated with a separate group of para meters e.g., v, e, and P; page 2, Fig. 1 discloses “Each layer of the super net contains several parallel edges representing convolution operators with quantized weights and activations with different precisions”; See aIso Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, " FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"];
for each group, search for a respective quantization bit providing a highest measured probability [page 3, section 3, “Normally 32-bit (full-precision) floating point numbers are used to represent weights and activations of neural nets. Quantization projects full-precision weights and activations to fixed-point numbers with lower bit-width, such as 8, 4, and 1 bit. We follow DoReFa-Net (Zhou et al. (2016)) to quantize weights and PACT (Choi et al. (2018)) to quantize activations”; page 5, section 5, We use the DNAS framework to solve the mixed precision quantization problem – deciding the optimal layer-wise precision assignment; Fig. 2, One layer of a super net for mixed precision quantization of a ConvNet. Nodes in the super net represent feature maps, edges represent convolution operators with different bit-widths; Examiner's Note: Figure 2 of Wu shows the use of "Edge Probability" in mixed precision quantization. Wu teaches using edge probability to replace 32-bit floating point numbers with fixed-point numbers (integers). See also Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, " FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"]; and
including the constraints within a loss function for backpropagation through the machine learning model [page 2, “we need to back propagate gradients through discrete random variables that control the stochastic edge execution”; page 4, section 4.3, “the gradient of the loss function with respect to ϴ can be computed as 

    PNG
    media_image1.png
    52
    546
    media_image1.png
    Greyscale
            (8)”; 
the equation (8) above discloses the loss function L (ma, wa); 
wherein, page 5, section 5, “we define the loss function as 
L(a, wa) = CrossEntropy(a) x C(Cost(a))”, 
and, page 6, section 5, recite “

    PNG
    media_image2.png
    46
    430
    media_image2.png
    Greyscale

where #FLOP(.) denotes the number of floating point operations (speed constraint, for example, less operations result to increasing the speed of the model)
and, 
“To compress the model size, we define the cost as

    PNG
    media_image3.png
    48
    358
    media_image3.png
    Greyscale
”
	Where, #PARAM(.) denotes the number of parameters of a convolution operator and weight-bit(.) denotes the bit-width of the weight.”
Based on the citing above, it can be seen that the loss function including the constraints such as size and/or speed constraints, etc.,
See also Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, " FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"]. 
Paragraph 0095 of the specification of the current Application recites “because of the fully back propagating nature of the model 900, one or more constraints can be added into the loss function such as a size constraint, an accuracy constraint, and/or an inference speed constraint … As a particular example, a loss function with an added size constraint and inference speed constraint might be expressed as follows:

    PNG
    media_image4.png
    30
    256
    media_image4.png
    Greyscale

FLOPs is the measurement of how many calculations are needed for an inference.
Wu does not teach
at least one of the constraints included in the loss function is prioritized over at least one other of the constraints included in the loss function, and wherein the at least one constraint is prioritized based on available resources for the electronic device via an initial setting of one or more values of the constraints in relation to each other within the loss function. 
Nachum teaches
at least one of the constraints included in the loss function is prioritized over at least one other of the constraints included in the loss function, and wherein the at least one constraint is prioritized based on available resources for the electronic device [paragraphs 0011-0013, “adaptively adjusting weights of one or more terms of the shrinking engine loss function that penalize active neurons of the neural network during training comprises: determining that a constraint is not met at a particular training iteration; paragraph 0018, “the set of one or more constraints include one or more of: a constraint on a maximum number of active neurons in the neural network; a constraint on a maximum inference latency of the neural network; a constraint on maximum power consumption of the neural network; a constraint on a maximum memory footprint of the neural network”; paragraph 0072, “in response to determining that a particular constraint of the set of constraints is not satisfied by the neural network during the training iteration, the engine may increase the values of the particular A-factors of the Joss function that penalize active neurons of the neural network. For example, in response to determining that a constraint on the maximum number of active neurons in the neural network is not met (e.g., because the number of active neurons in the neural network exceeds the maximum number specified by the constraint), the engine may uniformly increase (e.g., multiply by a fixed scaling factor) the values of the particular A-factors. As another example, in response to determining that a constraint on the maximum inference latency of the neural network is not met (e.g., because the inference latency of the neural network exceeds the maximum inference latency specified by the constraint), the engine may increase the values of the particular A-factors such that the increase in the value of the A-factor of a term is based on the number of operations induced by the neuron corresponding to the term. Specifically, the increase in the values of the A-factors of terms that correspond to neurons which induce a larger number of operations may be greater than the increase in the A-factors of terms that correspond to neurons which induce a smaller number of operations. Since the number of operations induced by the neurons of the neural network is directly related to the inference latency of the neural network, adaptively increasing the particular A-factors in this manner may contribute to the neural network satisfying the maximum inference latency constraint in future training iterations.”; It can be seen that the constraint on the latency is not met because the system needs to perform a large number of operations induce by a large number of neurons (for example, the number of active neurons exceeds the maximum number specified by the constraint). So, when either a constraint on the maximum number of active neurons in the neural network or a constraint on the maximum inference latency of the neural network is not met, the number of active neurons is adjusted (adjusting the weight factor that penalize active neurons/setting the weight to zero), and therefore, the constraint on the maximum number of active neurons is prioritized over the constraint on the maximum inference latency, because, limiting the number of active neurons in the neural network would reduce the number of operations induce by the neurons which could result in increasing the speed of training the network];
Wu in abstract, introduction and pages 2-4, teaches the method for finding an optimal assignment of precisions to reduce the model size, Wu also teaches a loss function comprises the constraints such as size constraint or speed constraint, etc., Wu, however, is silent of “at least one of the constraints included in the loss function is prioritized over at least one other of the constraints included in the loss function”. Nachum is added to fill in this missing element.
Nachum in Fig. 3, paragraphs 0011-0013, 0018 and 0072 teaches the loss function comprising the constraints, wherein, "the constraint is on a maximum number of active neurons in the neural network, the constraint is on a maximum inference latency of the neural network, a constraint on maximum power consumption of the neural network, etc.,"
Nachum further teaches when “a constraint on the maximum number of active neurons in the neural network is not met (e.g., because the number of active neurons in the neural network exceeds the maximum number specified by the constraint)”, or when “a constraint on the maximum inference latency of the neural network is not met”, the number of active neurons, which is one of the constraints, is adjusted, because the latency depends on the number of neurons, reducing the number of active neurons would help solving both the maximum number of active neurons constraint and the maximum inference latency constraint, thus, the number of active neurons constraint is prioritized over other constraints (inference latency constraint for example)
The combination of Wu and Nachum teaches the loss function comprises multiple constraints, wherein, one of the constraints included in the loss function (the number of active neurons constraint) is prioritized over other constraints (adjusting the number of active neurons when a certain constraint is not met) based on available resources for the electronic device such as model size or storage space.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the quantization method of Wu to include at least one of the constraints included in the loss function is prioritized over at least one other of the constraints via a modification of one or more values of the constraints in relation to each other within the loss function of Nachum. Doing so would help adjusting the parameters associated with the loss function so that a set of one or more constraints are satisfied (Nachum, 0073).
Wu and Nachum do not teach
at least one of the constraints is prioritized over at least one other of the constraints, wherein the at least one constraint is prioritized via an initial setting of one or more values of the constraints in relation to each other.
Saito teaches
at least one of the constraints is prioritized over at least one other of the constraints, wherein the at least one constraint is prioritized via an initial setting of one or more values of the constraints in relation to each other [paragraph 0007, “in the technical field of embedded computers, it is essential to build a system in a short term while satisfying constraints such as circuit area size, timing, performance, and power consumption”; paragraph 0065, “accepts an input by the user concerning priorities of the respective constraints such as the power consumption, the circuit area size, and the processing time. Specifically, in the case where there are plural constraints, setting priorities to the respective constraints … if there are two constraints i.e. the processing time and the power consumption, the priority of the processing time can be set higher than the priority of the power consumption. If there are constraints that the processing time is 0.5 second or shorter, and the power consumption is 2W or less, the judger 17 judges that the target system satisfies the constraint, despite that the specification value calculator 16 calculates that the processing time required for the entirety of the target system is 0.4 second, and the power consumption is 3 W. In other words, since the priority of the processing time is higher than the priority of the power consumption, the judger 17 judges that the target system satisfies the constraint, as far as the calculated processing time satisfies the constraint concerning the processing time, although the calculated power consumption does not satisfy the constraint concerning the power consumption”; It can be seen that the processing time constraint is prioritized over the power consumption constraint, and the initial setting of the constraints are .4 second and 2W respectively].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the quantization method of Wu to include at least one of the constraints is prioritized over at least one other of the constraints, wherein the at least one constraint is prioritized via an initial setting of one or more values of the constraints of Saito. Doing so would help determining whether the constraints are satisfied based on the priority and the setting values of the constraints (Saito, 0065).

Claim 10 has substantially the same limitations as claim 2 and the same analysis applies, thus, claim 10 is anticipated by Wu in view of Nachum and further in view of Saito.
Claim 11 has substantially the same limitations as claim 3 and the same analysis applies, thus, claim 11 is anticipated by Wu in view of Nachum and further in view of Saito.
Claim 12 has substantially the same limitations as claim 4 and the same analysis applies, thus, claim 12 is anticipated by Wu in view of Nachum and further in view of Saito.
Claim 13 has substantially the same limitations as claim 5 and the same analysis applies, thus, claim 13 is anticipated by Wu in view of Nachum and further in view of Saito.
Claim 14 has substantially the same limitations as claim 6 and the same analysis applies, thus, claim 14 is anticipated by Wu in view of Nachum and further in view of Saito.
Claim 15 has substantially the same limitations as claim 7 and the same analysis applies, thus, claim 15 is anticipated by Wu in view of Nachum and further in view of Saito.
Claim 16 has substantially the same limitations as claim 8 and the same analysis applies, thus, claim 16 is anticipated by Wu in view of Nachum and further in view of Saito.

As per claim 17, Wu teaches a non-transitory computer readable medium embodying a computer program, the computer program comprising instructions that when executed cause at least one processor of an electronic device to [page 1, abstract, “Recent work in network quantization has substantially reduced the time and space complexity of neural network inference, enabling their deployment on embedded and mobile devices with limited computational and memory resources”; Examiner's Note: Neural networks are interpreted as a type of machine learning that executes on the recited electronic devices with limited computational and memory resources. Also, see Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, “FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"]: 
receive an inference request [Fig. 1, page 3, section 4.2, “The operator takes the data tensor at vi as its input”; Examiner's Note: Wu recites a data tensor input that Examiner interprets as an inference request because input to the machine learning model will generate an inference]; 
determine, using the machine learning model, an inference result for the inference request using a selected inference path in the machine learning model [Fig. 1, "Executed edges" and "Not executed edges"; page 3, section 4.2, “The operator takes the data tensor at vi as its input and computes its output as eijk (vi; wijk)”; Examiner's Note: Wu teaches "Executed edges" that is considered to correspond to using a selected inference path. Also, see Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, “FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"], wherein:
the selected inference path is selected based on a highest probability that a result is accurate for each layer of the machine learning model [Figs. 1 and 2, “Edge Probability Pϴ1,2”; page 2, “the probability of execution is parameterized by some architecture parameters ϴ”; page 4, section 4.3, “To address this issue, we use Gumbel Softmax proposed by Jang et al. (2016); Maddison et al. (2016) to control the edge selection”; page 5, section 4.4, “we train the architecture parameter ϴ, to increase the probability to sample those edges with better performance, and to suppress those with worse performance”; Examiner's Note: As taught in Equation 4 on page 4 of Wu, edge probability Pϴij is calculated with the Soft max function performed on ϴ to "control edge selection" in order to increase the probability to sample edges with better performance, and to suppress those with worse performance based on a highest probability for each layer. See also Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, "FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"; paragraph [0098] of the present applicant recites, "The processor, may select the path having the highest probability" and recites using the Gumbel Soft max function previously disclosed by Wu]; and
a size of the machine learning model is reduced corresponding to constraints imposed by the electronic device [page 1, abstract, “Recent work in network quantization has substantially reduced the time and space complexity of neural network inference, enabling their deployment on embedded and mobile devices with limited computational and memory resources”; page 6, section 6.1, “We start by focusing on reducing model size, since smaller models require less storage and communication cost, which is important for mobile and embedded devices”; page 5, section 4.4, “Therefore, we train the architecture parameter ϴ, to increase the probability to sample those edges with better performance, and to suppress those with worse performance”; Examiner's Note: Wu is considered to suppress its edges with worse performance to effectively prune its model to enable it to run on an electronic device. By pruning the machine learning model, its size is reduced corresponding to the constraints imposed by the electron device so it can run on the electronic device. See also Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, "FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"; Paragraph [0095] “In some embodiments, paths or edges that are not selected can be pruned from the model to further decrease the size and increase the speed of the model”]; and
execute an action in response to the inference result [page 1, section 1.1, “ConvNets have become the de-facto method in a wide range of computer vision tasks”; Examiner's Note: Thus, one action executed in response to the inference result after performing a computer vision application, such as pattern recognition for identification of objects, is displaying the results of the pattern recognition. See also Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, "FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"];
wherein the machine learning model is trained by:
split parameters of the machine learning model into groups, wherein each group is associated with a specified layer of the machine learning model [page 3, section 3, “Normally 32-bit (full-precision) floating point numbers are used to represent weights and activations of neural nets. Quantization projects full-precision weights and activations to fixed-point numbers with lower bit-width, such as 8, 4, and 1 bit. We follow DoReFa-Net (Zhou et al. (2016)) to quantize weights and PACT (Choi et al. (2018)) to quantize activations”; page 6, section 6.1, “We start by focusing on reducing model size, since smaller models require less storage and communication cost, which is important for mobile and embedded devices”; Examiner's Note: As shown in Figure 1 of Wu, each layer is associated with a separate group of para meters e.g., v, e, and P; page 2, Fig. 1 discloses “Each layer of the super net contains several parallel edges representing convolution operators with quantized weights and activations with different precisions”; See aIso Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, " FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"];
for each group, search for a respective quantization bit providing a highest measured probability [page 3, section 3, “Normally 32-bit (full-precision) floating point numbers are used to represent weights and activations of neural nets. Quantization projects full-precision weights and activations to fixed-point numbers with lower bit-width, such as 8, 4, and 1 bit. We follow DoReFa-Net (Zhou et al. (2016)) to quantize weights and PACT (Choi et al. (2018)) to quantize activations”; page 5, section 5, We use the DNAS framework to solve the mixed precision quantization problem – deciding the optimal layer-wise precision assignment; Fig. 2, One layer of a super net for mixed precision quantization of a ConvNet. Nodes in the super net represent feature maps, edges represent convolution operators with different bit-widths; Examiner's Note: Figure 2 of Wu shows the use of "Edge Probability" in mixed precision quantization. Wu teaches using edge probability to replace 32-bit floating point numbers with fixed-point numbers (integers). See also Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, " FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"]; and
including the constraints within a loss function for backpropagation through the machine learning model [page 2, “we need to back propagate gradients through discrete random variables that control the stochastic edge execution”; page 4, section 4.3, “the gradient of the loss function with respect to ϴ can be computed as 

    PNG
    media_image1.png
    52
    546
    media_image1.png
    Greyscale
            (8)”; 
the equation (8) above discloses the loss function L (ma, wa); 
wherein, page 5, section 5, “we define the loss function as 
L(a, wa) = CrossEntropy(a) x C(Cost(a))”, 
and, page 6, section 5, recite “

    PNG
    media_image2.png
    46
    430
    media_image2.png
    Greyscale

where #FLOP(.) denotes the number of floating point operations (speed constraint, for example, less operations result to increasing the speed of the model)
and, 
“To compress the model size, we define the cost as

    PNG
    media_image3.png
    48
    358
    media_image3.png
    Greyscale
”
	Where, #PARAM(.) denotes the number of parameters of a convolution operator and weight-bit(.) denotes the bit-width of the weight.”
Based on the citing above, it can be seen that the loss function including the constraints such as size and/or speed constraints, etc.,
See also Fig. 1 of Wu and Fig. 9 of the present application, e.g., paragraph 0028, " FIG. 9 illustrates an example architecture searching model in accordance with various embodiments of this disclosure"]. 
Paragraph 0095 of the specification of the current Application recites “because of the fully back propagating nature of the model 900, one or more constraints can be added into the loss function such as a size constraint, an accuracy constraint, and/or an inference speed constraint … As a particular example, a loss function with an added size constraint and inference speed constraint might be expressed as follows:


    PNG
    media_image4.png
    30
    256
    media_image4.png
    Greyscale

FLOPs is the measurement of how many calculations are needed for an inference.
Wu does not teach
at least one of the constraints included in the loss function is prioritized over at least one other of the constraints included in the loss function, and wherein the at least one constraint is prioritized based on available resources for the electronic device via an initial setting of one or more values of the constraints in relation to each other within the loss function. 
Nachum teaches
at least one of the constraints included in the loss function is prioritized over at least one other of the constraints included in the loss function, and wherein the at least one constraint is prioritized based on available resources for the electronic device [paragraphs 0011-0013, “adaptively adjusting weights of one or more terms of the shrinking engine loss function that penalize active neurons of the neural network during training comprises: determining that a constraint is not met at a particular training iteration; paragraph 0018, “the set of one or more constraints include one or more of: a constraint on a maximum number of active neurons in the neural network; a constraint on a maximum inference latency of the neural network; a constraint on maximum power consumption of the neural network; a constraint on a maximum memory footprint of the neural network”; paragraph 0072, “in response to determining that a particular constraint of the set of constraints is not satisfied by the neural network during the training iteration, the engine may increase the values of the particular A-factors of the Joss function that penalize active neurons of the neural network. For example, in response to determining that a constraint on the maximum number of active neurons in the neural network is not met (e.g., because the number of active neurons in the neural network exceeds the maximum number specified by the constraint), the engine may uniformly increase (e.g., multiply by a fixed scaling factor) the values of the particular A-factors. As another example, in response to determining that a constraint on the maximum inference latency of the neural network is not met (e.g., because the inference latency of the neural network exceeds the maximum inference latency specified by the constraint), the engine may increase the values of the particular A-factors such that the increase in the value of the A-factor of a term is based on the number of operations induced by the neuron corresponding to the term. Specifically, the increase in the values of the A-factors of terms that correspond to neurons which induce a larger number of operations may be greater than the increase in the A-factors of terms that correspond to neurons which induce a smaller number of operations. Since the number of operations induced by the neurons of the neural network is directly related to the inference latency of the neural network, adaptively increasing the particular A-factors in this manner may contribute to the neural network satisfying the maximum inference latency constraint in future training iterations.”; It can be seen that the constraint on the latency is not met because the system needs to perform a large number of operations induce by a large number of neurons (for example, the number of active neurons exceeds the maximum number specified by the constraint). So, when either a constraint on the maximum number of active neurons in the neural network or a constraint on the maximum inference latency of the neural network is not met, the number of active neurons is adjusted (adjusting the weight factor that penalize active neurons/setting the weight to zero), and therefore, the constraint on the maximum number of active neurons is prioritized over the constraint on the maximum inference latency, because, limiting the number of active neurons in the neural network would reduce the number of operations induce by the neurons which could result in increasing the speed of training the network];
Wu in abstract, introduction and pages 2-4, teaches the method for finding an optimal assignment of precisions to reduce the model size, Wu also teaches a loss function comprises the constraints such as size constraint or speed constraint, etc., Wu, however, is silent of “at least one of the constraints included in the loss function is prioritized over at least one other of the constraints included in the loss function”. Nachum is added to fill in this missing element.
Nachum in Fig. 3, paragraphs 0011-0013, 0018 and 0072 teaches the loss function comprising the constraints, wherein, "the constraint is on a maximum number of active neurons in the neural network, the constraint is on a maximum inference latency of the neural network, a constraint on maximum power consumption of the neural network, etc.,"
Nachum further teaches when “a constraint on the maximum number of active neurons in the neural network is not met (e.g., because the number of active neurons in the neural network exceeds the maximum number specified by the constraint)”, or when “a constraint on the maximum inference latency of the neural network is not met”, the number of active neurons, which is one of the constraints, is adjusted, because the latency depends on the number of neurons, reducing the number of active neurons would help solving both the maximum number of active neurons constraint and the maximum inference latency constraint, thus, the number of active neurons constraint is prioritized over other constraints (inference latency constraint for example)
The combination of Wu and Nachum teaches the loss function comprises multiple constraints, wherein, one of the constraints included in the loss function (the number of active neurons constraint) is prioritized over other constraints (adjusting the number of active neurons when a certain constraint is not met) based on available resources for the electronic device such as model size or storage space.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the quantization method of Wu to include at least one of the constraints included in the loss function is prioritized over at least one other of the constraints via a modification of one or more values of the constraints in relation to each other within the loss function of Nachum. Doing so would help adjusting the parameters associated with the loss function so that a set of one or more constraints are satisfied (Nachum, 0073).
Wu and Nachum do not teach
at least one of the constraints is prioritized over at least one other of the constraints, wherein the at least one constraint is prioritized via an initial setting of one or more values of the constraints in relation to each other.
Saito teaches
at least one of the constraints is prioritized over at least one other of the constraints, wherein the at least one constraint is prioritized via an initial setting of one or more values of the constraints in relation to each other [paragraph 0007, “in the technical field of embedded computers, it is essential to build a system in a short term while satisfying constraints such as circuit area size, timing, performance, and power consumption”; paragraph 0065, “accepts an input by the user concerning priorities of the respective constraints such as the power consumption, the circuit area size, and the processing time. Specifically, in the case where there are plural constraints, setting priorities to the respective constraints … if there are two constraints i.e. the processing time and the power consumption, the priority of the processing time can be set higher than the priority of the power consumption. If there are constraints that the processing time is 0.5 second or shorter, and the power consumption is 2W or less, the judger 17 judges that the target system satisfies the constraint, despite that the specification value calculator 16 calculates that the processing time required for the entirety of the target system is 0.4 second, and the power consumption is 3 W. In other words, since the priority of the processing time is higher than the priority of the power consumption, the judger 17 judges that the target system satisfies the constraint, as far as the calculated processing time satisfies the constraint concerning the processing time, although the calculated power consumption does not satisfy the constraint concerning the power consumption”; It can be seen that the processing time constraint is prioritized over the power consumption constraint, and the initial setting of the constraints are .4 second and 2W respectively].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the quantization method of Wu to include at least one of the constraints is prioritized over at least one other of the constraints, wherein the at least one constraint is prioritized via an initial setting of one or more values of the constraints of Saito. Doing so would help determining whether the constraints are satisfied based on the priority and the setting values of the constraints (Saito, 0065).

Claim 18 has substantially the same limitations as claim 2 and the same analysis applies, thus, claim 18 is anticipated by Wu in view of Nachum and further in view of Saito.
Claim 19 has substantially the same limitations as claim 3 and the same analysis applies, thus, claim 19 is anticipated by Wu in view of Nachum and further in view of Saito.
Claim 20 has substantially the same limitations as claim 6 and the same analysis applies, thus, claim 14 is anticipated by Wu in view of Nachum and further in view of Saito.


Prior Art
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure.
Corley et al. (US Pub. 2008/0134193) describes a method of allocating resources comprising an objective function and a set of constraints describing feasible allocations of the resources.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to TRI T NGUYEN whose telephone number is 571-272-0103. The examiner can normally be reached M-F, 8 AM-5 PM, (CT).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, OMAR FERNANDEZ can be reached at 571-272-2589. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/TRI T NGUYEN/Examiner, Art Unit 2128               

/RYAN C VAUGHN/Primary Examiner, Art Unit 2125
Read full office action
Prosecution Timeline

Show 15 earlier events
Oct 07, 2025
Interview Requested
Nov 11, 2025
Response after Non-Final Action
Dec 22, 2025
Request for Continued Examination
Jan 15, 2026
Response after Non-Final Action
Jan 23, 2026
Non-Final Rejection — §103
Mar 30, 2026
Interview Requested
Apr 09, 2026
Applicant Interview (Telephonic)
Apr 09, 2026
Examiner Interview Summary
Precedent Cases

Applications granted by this same examiner with similar technology

17/086,271
Patent 12608586
PARAMETER SHARING DECODER PAIR FOR AUTO COMPOSING
5y 5m to grant Granted Apr 21, 2026
17/011,734
Patent 12572820
METHODS AND SYSTEMS FOR GENERATING KNOWLEDGE GRAPHS FROM PROGRAM SOURCE CODE
5y 6m to grant Granted Mar 10, 2026
16/976,398
Patent 12536418
PERTURBATIVE NEURAL NETWORK
5y 5m to grant Granted Jan 27, 2026
17/083,367
Patent 12524662
BLOCKCHAIN FOR ARTIFICIAL INTELLIGENCE TRAINING
5y 2m to grant Granted Jan 13, 2026
17/277,118
Patent 12493963
JOINT UNSUPERVISED OBJECT SEGMENTATION AND INPAINTING
4y 8m to grant Granted Dec 09, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

5-6
Expected OA Rounds
68%
Grant Probability
82%
With Interview (+13.2%)
3y 10m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 183 resolved cases by this examiner. Grant probability derived from career allowance rate.