Prosecution Insights
Last updated: April 19, 2026
Application No. 18/171,433

METHOD OF TRAINING BINARIZED NEURAL NETWORK WITH PARAMETERIZED WEIGHT CLIPPING AND MEMORY DEVICE USING THE SAME

Final Rejection §102§103§112
Filed
Feb 20, 2023
Examiner
BOSTWICK, SIDNEY VINCENT
Art Unit
2124
Tech Center
2100 — Computer Architecture & Software
Assignee
Samsung Electronics Co., Ltd.
OA Round
2 (Final)
52%
Grant Probability
Moderate
3-4
OA Rounds
4y 7m
To Grant
90%
With Interview

Examiner Intelligence

Grants 52% of resolved cases
52%
Career Allow Rate
71 granted / 136 resolved
-2.8% vs TC avg
Strong +38% interview lift
Without
With
+38.2%
Interview Lift
resolved cases with interview
Typical timeline
4y 7m
Avg Prosecution
68 currently pending
Career history
204
Total Applications
across all art units

Statute-Specific Performance

§101
24.4%
-15.6% vs TC avg
§103
40.9%
+0.9% vs TC avg
§102
12.0%
-28.0% vs TC avg
§112
21.9%
-18.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 136 resolved cases

Office Action

§102 §103 §112
Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Remarks This Office Action is responsive to Applicants' Amendment filed on January 6, 2026, in which claims 1, 3, 8 and 16 are currently amended. Claims 1-20 are currently pending. Information Disclosure Statement The information disclosure statement (IDS) submitted on January 6, 2026 are in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner. Response to Arguments The rejections to claims 1-20 under 35 U.S.C. § 101 are hereby withdrawn, as necessitated by applicant's amendments and remarks made to the rejections. Applicant’s arguments with respect to rejection of claims 1-20 under 35 U.S.C. 103 based on amendment have been considered, however, are not persuasive. After further consideration Examiner notes that the instant claims only recite first and second threshold values as compound pronouns such that interpreting Xu’s tau as a threshold value is very reasonable ([p. 7] “given an initial tau_s and end threshold tau_e, tau_i at the ith training epoch”). More specifically Tau explicitly defines a tau for each epoch of training (tau_i) such that the initial epoch (tau_s) is interpreted as the first threshold value and a second epoch (tau_i for i=2) is interpreted as the second threshold value. Xu further discloses (as would be immediately understood by one of ordinary skill in the art) that each epoch of training involves computation of a gradient ([p. 6] “we complete the binary convolution using Eq.(3) for the forward propagation. During backpropagation, we derive the gradients w.r.t. W and A using Eq. (5) and Eq. (6), respectively, and update W using the stochastic gradient descent (SGD) described in Sec. 5.1.”) where one epoch involves iterative forward propagation followed by a backpropagation for each batch and a gradient for each respective batch (Xu discloses using a batch size of 256 for CIFAR-10 dataset which has 60k samples which means 235 iterations and corresponding gradients per epoch). In other words, tau_i corresponds to a gradient computed for the ith epoch, such that said gradient is interpreted as a gradient of the threshold value, and where there are explicitly more than 2 epochs (Xu explicitly uses 600 epochs). The instant claims do not explicitly limit the compound pronouns “a gradient of the first threshold value” or “a gradient of the second threshold value” and only requires broadly “using” said broadly defined compound pronouns. For these reasons Examiner asserts that this interpretation is reasonable and the rejection in view of Xu should be maintained. Claim Rejections - 35 USC § 112 The following is a quotation of 35 U.S.C. 112(b): (b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention. The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph: The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention. Claims 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention. Regarding claim 3, "wherein the changing range of the clipping function comprises a subtraction" is indefinite. It would be unclear to one of ordinary skill in the art how a range can comprise a subtraction and the instant specification does not disclose how a changing range can comprise a subtraction. In the interest of further examination the claim limitation is interpreted as "wherein the changing range of the clipping function is based on a subtraction". Claim Rejections - 35 USC § 102 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action: A person shall be entitled to a patent unless – (a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention. Claims 1-5 and 9-15 are rejected under U.S.C. §102(a)(1) as being anticipated by Xu (“ReCU: Reviving the Dead Weights in Binary Neural Networks”, 2021). Regarding claim 1, Xu teaches A method of training a binarized neural network (BNN), the method comprising: generating a binarized weight set by applying a clipping function to a weight set;([Abstract] "In this paper, for the first time, we explore the influence of “dead weights” which refer to a group of weights that are barely updated during the training of BNNs, and then introduce rectified clamp unit (ReCU) to revive the “dead weights” for updating" clamp unit interpreted as clipping function.) generating output data by sequentially performing a forward computation on the binarized neural network based on input data and the binarized weight set; ([p. 6 §4.3] "in the forward propagation, we first standardize W and revive the “dead weights” using Eq. (22) and ReCU of Eq. (8), respectively. Then, we compute the scaling factor α using Eq. (13), and binarize the inputs and the revived weights using the sign function of Eq. (2). Finally, we complete the binary convolution using Eq. (3) for the forward propagation") generating a gradient of the weight set by sequentially performing a backward computation on the binarized neural network based on loss calculated from the output data; and([p. 6 §4.3] "During backpropagation, we derive the gradients w.r.t. W and A using Eq. (5) and Eq. (6), respectively, and update W using the stochastic gradient descent (SGD) described in Sec. 5.1.") training the binarized neural network, comprising:([p. 2 §1] "In this paper, we present a novel perspective to improve the effectiveness and training efficiency of BNN" [p. 6 §5.1] "Training Details. Our network is trained from scratch without depending on a pre-trained model. For all experiments, we use SGD for optimization with a momentum of 0.9 and the weight-decay is set to 5e-4") updating the weight set based on the gradient of the weight set; and([p. 6 §4.3] "During backpropagation, we derive the gradients w.r.t. W and A using Eq. (5) and Eq. (6), respectively, and update W using the stochastic gradient descent (SGD) described in Sec. 5.1") changing a range of the clipping function.([p. 4 4.1] "in Sec. 5, we introduce an adaptive exponential scheduler to identify the range of the “dead weights” in order to seek a balance between the quantization error and the information entropy" See Eqn. 8. [p. 7 §5.2.1] "we further propose an exponential scheduler for adapting τ along the network training. Our motivation lies in that τ should start with a value falling within [0.85, 0.94] to pursue a good accuracy, and then gradually go to the interval [0.96, 1.00] to stabilize the variance of performance. Based on this, given an initial τs and an end threshold τe, τi at the i-th training epoch is calculated as follows" See Eqn. 24. Tao is the parameter that controls the clipping range.) wherein the clipping function includes a first threshold value and a second threshold value, and wherein changing the range of the clipping function comprises using a gradient of the first threshold value of the clipping function ([p. 4 4.1] "in Sec. 5, we introduce an adaptive exponential scheduler to identify the range of the “dead weights” in order to seek a balance between the quantization error and the information entropy" See Eqn. 8. [p. 7 §5.2.1] "we further propose an exponential scheduler for adapting τ along the network training. Our motivation lies in that τ should start with a value falling within [0.85, 0.94] to pursue a good accuracy, and then gradually go to the interval [0.96, 1.00] to stabilize the variance of performance. Based on this, given an initial τs and an end threshold τe, τi at the i-th training epoch is calculated as follows" See Eqn. 24. Tau is the parameter that controls the clipping range, tau in the first epoch is interpreted as the first threshold value, the gradient computed this epoch is interpreted as the gradient of the first threshold (tau). Tau is changed from first value to second value to third value.). Regarding claim 2, Xu teaches The method of claim 1, wherein the range of the clipping function is adaptively changed based on a magnitude of the weight set.(Xu [p. 5 §4.3] "Accordingly, the information entropy of W after applying ReCU can be computed by [See Eqn. 17] which is a function of τ by substituting ˆb in Eq. (11) for b. [...] the mean of the absolute values of the weights after standardization is [See Eqn. 20]" Eqn. 20 explicitly computes a magnitude of the weight set, where the information entropy of weights in Eqn. 17 varies with b and tau which are adaptively changed based on said magnitude.). Regarding claim 3, Xu teaches wherein the changing range of the clipping function comprises a subtraction(Xu [p. 4 4.1] "in Sec. 5, we introduce an adaptive exponential scheduler to identify the range of the “dead weights” in order to seek a balance between the quantization error and the information entropy" See Eqn. 8. [p. 7 §5.2.1] "we further propose an exponential scheduler for adapting τ along the network training. Our motivation lies in that τ should start with a value falling within [0.85, 0.94] to pursue a good accuracy, and then gradually go to the interval [0.96, 1.00] to stabilize the variance of performance. Based on this, given an initial τs and an end threshold τe, τi at the i-th training epoch is calculated as follows" The claim is interpreted as "wherein the changing range of the clipping function is based on a subtraction". See Eqn. 24. Tau is the parameter that controls the clipping range, tau in the first epoch is interpreted as the first threshold value, the gradient computed this epoch is interpreted as the gradient of the first threshold (tau). Tau is changed from first value to second value to third value. See also Eqn. 4 which involves a subtraction to determine the quantization error which is then used to compute said epoch gradient.) and a gradient of the second threshold value of the clipping function.(Xu [p. 4 4.1] "in Sec. 5, we introduce an adaptive exponential scheduler to identify the range of the “dead weights” in order to seek a balance between the quantization error and the information entropy" See Eqn. 8. [p. 7 §5.2.1] "we further propose an exponential scheduler for adapting τ along the network training. Our motivation lies in that τ should start with a value falling within [0.85, 0.94] to pursue a good accuracy, and then gradually go to the interval [0.96, 1.00] to stabilize the variance of performance. Based on this, given an initial τs and an end threshold τe, τi at the i-th training epoch is calculated as follows" See Eqn. 24. Tau is the parameter that controls the clipping range, tau in the second epoch is interpreted as the second threshold value, the gradient computed this epoch is interpreted as the gradient of the second threshold (tau).). Regarding claim 4, Xu teaches The method of claim 1, wherein changing the range of the clipping function comprises changing at least one of a first threshold value (Xu [p. 4 4.1] "in Sec. 5, we introduce an adaptive exponential scheduler to identify the range of the “dead weights” in order to seek a balance between the quantization error and the information entropy" See Eqn. 8. [p. 7 §5.2.1] "we further propose an exponential scheduler for adapting τ along the network training. Our motivation lies in that τ should start with a value falling within [0.85, 0.94] to pursue a good accuracy, and then gradually go to the interval [0.96, 1.00] to stabilize the variance of performance. Based on this, given an initial τs and an end threshold τe, τi at the i-th training epoch is calculated as follows" See Eqn. 24. Tao is the parameter that controls the clipping range.) and a second threshold value of the clipping function.(Xu [p. 4 4.2] "Obviously, α is a function of τ after replacing b with the estimation ˆb in Eq" alpha interpreted as second threshold value which is explicitly varied with the first threshold value tau.). Regarding claim 5, Xu teaches The method of claim 1, wherein generating the binarized weight set comprises: obtaining a clipped weight set by applying the clipping function to a plurality of weight elements included in the weight set; and(Xu [p. 6 §4.3] "we first standardize W and revive the “dead weights” using Eq. (22) and ReCU of Eq. (8), respectively") obtaining the binarized weight set by applying a scaled sign function to a plurality of clipped weight elements included in the clipped weight set.(Xu [p. 6 §4.3] "we compute the scaling factor α using Eq. (13), and binarize the inputs and the revived weights using the sign function of Eq. (2)"). Regarding claim 9, Xu teaches The method of claim 1, further comprising: iteratively performing the operations of generating the binarized weight set, generating the output data, generating the gradient of the weight set, updating the weight set, and changing the range of the clipping function.(Xu [p. 7 §5.2.1] "Our motivation lies in that τ should start with a value falling within [0.85, 0.94] to pursue a good accuracy, and then gradually go to the interval [0.96, 1.00] to stabilize the variance of performance. Based on this, given an initial τs and an end threshold τe, τi at the i-th training epoch is calculated as follows [See Eqn. 24] where I denotes the total number of training epochs" epoch interpreted as synonymous with iteration). Regarding claim 10, Xu teaches The method of claim 9, further comprising: storing a result of the training operation after iteratively performing the operations of generating the binarized weight set, generating the output data, generating the gradient of the weight set, updating the weight set, and changing the range of the clipping function a predetermined number of iterations.(Xu See Table 6 where results of the training operation after iteratively performing the operations of generating the binarized weight set, generating the output data, generating the gradient of the weight set, updating the weight set, and changing the range of the clipping function a predetermined number of iterations are stored/logged.). Regarding claim 12, Xu teaches The method of claim 10, wherein storing the result of the training operation comprises: storing the binarized weight set.(Xu [pp. 15-16] "Quantized model link" provides links to stored binarized weight sets produced by the method.). Regarding claim 13, Xu teaches The method of claim 1, wherein the binarized neural network is a binarized convolutional neural network (BCNN).(Xu [p. 6 §5.1] "Network Structures. On CIFAR-10, we evaluate ReCU with ResNet-18/20 [21] and VGG-Small [51]. Following the compared methods, we binarize all convolutional and fully-connected layers except the first and the last ones"). Regarding claim 14, Xu teaches The method of claim 13, wherein the binarized convolutional neural network includes a plurality of convolutional layers, and wherein the operations of generating the binarized weight set, generating the output data, generating the gradient of the weight set, updating the weight set, and changing the range of the clipping function are performed on at least one of remaining convolutional layers other than a first convolutional layer among the plurality of convolutional layers.(Xu [p. 6 §5.1] "Network Structures. On CIFAR-10, we evaluate ReCU with ResNet-18/20 [21] and VGG-Small [51]. Following the compared methods, we binarize all convolutional and fully-connected layers except the first and the last ones"). Regarding claim 15, Xu teaches The method of claim 13, wherein the binarized convolutional neural network includes a plurality of layers, wherein the plurality of layers include at least one fully connected layer, and wherein the operations of generating the binarized weight set, generating the output data, generating the gradient of the weight set, updating the weight set, and changing the range of the clipping function are performed on at least one of remaining layers other than the fully connected layer among the plurality of layers.(Xu [p. 6 §5.1] "Network Structures. On CIFAR-10, we evaluate ReCU with ResNet-18/20 [21] and VGG-Small [51]. Following the compared methods, we binarize all convolutional and fully-connected layers except the first and the last ones"). Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows: 1. Determining the scope and contents of the prior art. 2. Ascertaining the differences between the prior art and the claims at issue. 3. Resolving the level of ordinary skill in the pertinent art. 4. Considering objective evidence present in the application indicating obviousness or nonobviousness. Claims 6, 7, and 20 are rejected under U.S.C. §103 as being unpatentable over Xu as evidenced by He (“Deep Residual Learning for Image Recognition”, 2016). Regarding claim 6, Xu teaches The method of claim 1, wherein generating the output data comprises: performing a convolution operation on the input data using the binarized weight set;(Xu [p. 6 §5.1] "we binarize all convolutional and fully-connected layers" See also performance comparison tables 5 and 6 which are output accuracies compared to expected results when using the binarization method.). However, Xu doesn't explicitly teach performing a pooling operation on a result of the convolution operation; performing a batch normalization on a result of the pooling operation; and obtaining the output data by applying an activation function to a result of the batch normalization.. He, in the same field of endeavor, teaches performing a pooling operation on a result of the convolution operation;([p. 776 §4.2] "The network ends with a global average pooling" See also ResNet-34 architecture in FIG. 3 which shows pooling applied to the first and last convolutions) performing a batch normalization on a result of the pooling operation; and([p. 4 §3.4] "We adopt batch normalization (BN) [16] right after each convolution and before activation") obtaining the output data by applying an activation function to a result of the batch normalization.(See also ResNet-34 architecture in FIG. 3 which shows pooling applied to the first and last convolutions). He is merely introduced as a teaching reference to explicitly reinforce the model structure used by Xu. It would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention that the ResNet model was first described by He and the ResNet model explicitly used by Xu has a known structure. Regarding claim 7, Xu teaches The method of claim 1, wherein generating the gradient of the weight set comprises: reversely applying an activation function to loss input data calculated from the output data;(Xu [p. 6 §5.1] "we binarize all convolutional and fully-connected layers" [p. 6 §4.3] "update W using the stochastic gradient descent (SGD)" See also performance comparison tables 5 and 6 which are output accuracies compared to expected results when using the binarization method. Backpropagation with stochastic gradient descent (SGD) applies the loss calculated from the input data to the network in reverse by definition.) obtaining a gradient of a clipped weight set by reversely applying a scaled sign function to the gradient of the binarized weight set; and(Xu [p. 2] "Since the quantization function in the forward propagation of BNNs has zero gradient almost everywhere, an approximate gradient function is required to enable the network to update. A typical example is the straight through estimator (STE) [...] The experimental results show that ReCU achieves state-of-the-art performance, as well as faster training convergence even with the simple STE [4] as our weight gradient approximation" Xu explicitly uses STE to reversely apply a gradient to the sign function.) obtaining the gradient of the weight set, a gradient of a first threshold value of the clipping function, (Xu [p. 4 4.1] "in Sec. 5, we introduce an adaptive exponential scheduler to identify the range of the “dead weights” in order to seek a balance between the quantization error and the information entropy" See Eqn. 8. [p. 7 §5.2.1] "we further propose an exponential scheduler for adapting τ along the network training. Our motivation lies in that τ should start with a value falling within [0.85, 0.94] to pursue a good accuracy, and then gradually go to the interval [0.96, 1.00] to stabilize the variance of performance. Based on this, given an initial τs and an end threshold τe, τi at the i-th training epoch is calculated as follows" See Eqn. 24. Tau is the parameter that controls the clipping range, tau in the first epoch is interpreted as the first threshold value, the gradient computed this epoch is interpreted as the gradient of the first threshold (tau). Tau is changed from first value to second value to third value.) and a gradient of a second threshold value of the clipping function by reversely applying the clipping function to the gradient of the clipped weight set.(Xu [p. 4 4.1] "in Sec. 5, we introduce an adaptive exponential scheduler to identify the range of the “dead weights” in order to seek a balance between the quantization error and the information entropy" See Eqn. 8. [p. 7 §5.2.1] "we further propose an exponential scheduler for adapting τ along the network training. Our motivation lies in that τ should start with a value falling within [0.85, 0.94] to pursue a good accuracy, and then gradually go to the interval [0.96, 1.00] to stabilize the variance of performance. Based on this, given an initial τs and an end threshold τe, τi at the i-th training epoch is calculated as follows" See Eqn. 24. Tau is the parameter that controls the clipping range, tau in the second epoch is interpreted as the second threshold value, the gradient computed this epoch is interpreted as the gradient of the second threshold (tau).). However, Xu doesn't explicitly teach reversely performing a batch normalization on a result of reversely applying the activation function; reversely performing a pooling operation on a result of reversely performing the batch normalization; obtaining loss output data and a gradient of the binarized weight set by reversely performing a convolution operation on a result of reversely performing the pooling operation;. He, in the same field of endeavor, teaches reversely performing a batch normalization on a result of reversely applying the activation function;([p. 776 §4.2] "The network ends with a global average pooling" See also ResNet-34 architecture in FIG. 3 which shows pooling applied to the first and last convolutions) reversely performing a pooling operation on a result of reversely performing the batch normalization;([p. 4 §3.4] "We adopt batch normalization (BN) [16] right after each convolution and before activation") obtaining loss output data and a gradient of the binarized weight set by reversely performing a convolution operation on a result of reversely performing the pooling operation;(See also ResNet-34 architecture in FIG. 3 which shows pooling applied to the first and last convolutions). He is merely introduced as a teaching reference to explicitly reinforce the model structure used by Xu. It would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention that the ResNet model was first described by He and the ResNet model explicitly used by Xu has a known structure. Regarding claim 20, Xu teaches A method of training a binarized neural network (BNN), the method comprising: obtaining a clipped weight set by applying a clipping function to a weight set;([Abstract] "In this paper, for the first time, we explore the influence of “dead weights” which refer to a group of weights that are barely updated during the training of BNNs, and then introduce rectified clamp unit (ReCU) to revive the “dead weights” for updating" clamp unit interpreted as clipping function.) obtaining a binarized weight set by applying a scaled sign function to the clipped weight set;([p. 6 §4.3] "we compute the scaling factor α using Eq. (13), and binarize the inputs and the revived weights using the sign function of Eq. (2)") generating output data by sequentially performing a forward computation on the binarized neural network based on input data and the binarized weight set;([p. 6 §4.3] "in the forward propagation, we first standardize W and revive the “dead weights” using Eq. (22) and ReCU of Eq. (8), respectively. Then, we compute the scaling factor α using Eq. (13), and binarize the inputs and the revived weights using the sign function of Eq. (2). Finally, we complete the binary convolution using Eq. (3) for the forward propagation") generating a gradient of the weight set, a gradient of a first threshold value of the clipping function, and a gradient of a second threshold value of the clipping function by sequentially performing a backward computation on the binarized neural network based on loss calculated from the output data;([p. 6 §4.3] "During backpropagation, we derive the gradients w.r.t. W and A using Eq. (5) and Eq. (6), respectively, and update W using the stochastic gradient descent (SGD) described in Sec. 5.1.") updating the weight set based on the gradient of the weight set;([p. 6 §4.3] "During backpropagation, we derive the gradients w.r.t. W and A using Eq. (5) and Eq. (6), respectively, and update W using the stochastic gradient descent (SGD) described in Sec. 5.1") updating the first threshold value of the clipping function based on the gradient of the first threshold value of the clipping function; and([p. 4 4.1] "in Sec. 5, we introduce an adaptive exponential scheduler to identify the range of the “dead weights” in order to seek a balance between the quantization error and the information entropy" See Eqn. 8. [p. 7 §5.2.1] "we further propose an exponential scheduler for adapting τ along the network training. Our motivation lies in that τ should start with a value falling within [0.85, 0.94] to pursue a good accuracy, and then gradually go to the interval [0.96, 1.00] to stabilize the variance of performance. Based on this, given an initial τs and an end threshold τe, τi at the i-th training epoch is calculated as follows" See Eqn. 24. Tau is the parameter that controls the clipping range, tau in the first epoch is interpreted as the first threshold value, the gradient computed this epoch is interpreted as the gradient of the first threshold (tau). Tau is changed from first value to second value to third value.) updating the second threshold value of the clipping function based on the gradient of the second threshold value of the clipping function,([p. 4 4.1] "in Sec. 5, we introduce an adaptive exponential scheduler to identify the range of the “dead weights” in order to seek a balance between the quantization error and the information entropy" See Eqn. 8. [p. 7 §5.2.1] "we further propose an exponential scheduler for adapting τ along the network training. Our motivation lies in that τ should start with a value falling within [0.85, 0.94] to pursue a good accuracy, and then gradually go to the interval [0.96, 1.00] to stabilize the variance of performance. Based on this, given an initial τs and an end threshold τe, τi at the i-th training epoch is calculated as follows" See Eqn. 24. Tau is the parameter that controls the clipping range, tau in the second epoch is interpreted as the second threshold value, the gradient computed this epoch is interpreted as the gradient of the second threshold (tau).) wherein: the binarized neural network is a binarized convolutional neural network (BCNN),([p. 6 §5.1] "Network Structures. On CIFAR-10, we evaluate ReCU with ResNet-18/20 [21] and VGG-Small [51]. Following the compared methods, we binarize all convolutional and fully-connected layers except the first and the last ones") the scaled sign function, and the clipping function,([p. 2] "Since the quantization function in the forward propagation of BNNs has zero gradient almost everywhere, an approximate gradient function is required to enable the network to update. A typical example is the straight through estimator (STE) [...] The experimental results show that ReCU achieves state-of-the-art performance, as well as faster training convergence even with the simple STE [4] as our weight gradient approximation" Xu explicitly uses STE to reversely apply a gradient to the sign function.) a range of the clipping function is adaptively changed based on a magnitude of the weight set, and([p. 5 §4.3] "Accordingly, the information entropy of W after applying ReCU can be computed by [See Eqn. 17] which is a function of τ by substituting ˆb in Eq. (11) for b. [...] the mean of the absolute values of the weights after standardization is [See Eqn. 20]" Eqn. 20 explicitly computes a magnitude of the weight set, where the information entropy of weights in Eqn. 17 varies with b and tau which are adaptively changed based on said magnitude.) the range of the clipping function is changed using the gradient of the first threshold value of the clipping function ([p. 4 4.1] "in Sec. 5, we introduce an adaptive exponential scheduler to identify the range of the “dead weights” in order to seek a balance between the quantization error and the information entropy" See Eqn. 8. [p. 7 §5.2.1] "we further propose an exponential scheduler for adapting τ along the network training. Our motivation lies in that τ should start with a value falling within [0.85, 0.94] to pursue a good accuracy, and then gradually go to the interval [0.96, 1.00] to stabilize the variance of performance. Based on this, given an initial τs and an end threshold τe, τi at the i-th training epoch is calculated as follows" See Eqn. 24. Tau is the parameter that controls the clipping range, tau in the first epoch is interpreted as the first threshold value, the gradient computed this epoch is interpreted as the gradient of the first threshold (tau). Tau is changed from first value to second value to third value.) and the gradient of the second threshold value of the clipping function.([p. 4 4.1] "in Sec. 5, we introduce an adaptive exponential scheduler to identify the range of the “dead weights” in order to seek a balance between the quantization error and the information entropy" See Eqn. 8. [p. 7 §5.2.1] "we further propose an exponential scheduler for adapting τ along the network training. Our motivation lies in that τ should start with a value falling within [0.85, 0.94] to pursue a good accuracy, and then gradually go to the interval [0.96, 1.00] to stabilize the variance of performance. Based on this, given an initial τs and an end threshold τe, τi at the i-th training epoch is calculated as follows" See Eqn. 24. Tau is the parameter that controls the clipping range, tau in the second epoch is interpreted as the second threshold value, the gradient computed this epoch is interpreted as the gradient of the second threshold (tau).). However, Xu does not explicitly teach the forward computation is performed in an order of a convolution operation, a pooling operation, a batch normalization, and an activation function, the backward computation is performed in an order of the activation function, the batch normalization, the pooling operation, the convolution operation,. He, in the same field of endeavor, teaches the forward computation is performed in an order of a convolution operation, a pooling operation, a batch normalization, and an activation function,([p. 4 3.4] "We adopt batch normalization (BN) [16] right after each convolution and before activation" See also FIG. 3) the backward computation is performed in an order of the activation function, the batch normalization, the pooling operation, the convolution operation,([p. 4 3.4] "We adopt batch normalization (BN) [16] right after each convolution and before activation" See also FIG. 3). He is merely introduced as a teaching reference to explicitly reinforce the model structure used by Xu. It would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention that the ResNet model was first described by He and the ResNet model explicitly used by Xu has a known structure. Claim 8 is rejected under U.S.C. §103 as being unpatentable over the combination of Xu and Stanford (“CS231n Convolutional Neural Networks for Visual Recognition”, 2021). Regarding claim 8, Xu teaches The method of claim 1 and wherein changing the range of the clipping function comprises: updating a first threshold value of the clipping function based on a gradient of the first threshold value of the clipping function; and (Xu [p. 4 4.1] "in Sec. 5, we introduce an adaptive exponential scheduler to identify the range of the “dead weights” in order to seek a balance between the quantization error and the information entropy" See Eqn. 8. [p. 7 §5.2.1] "we further propose an exponential scheduler for adapting τ along the network training. Our motivation lies in that τ should start with a value falling within [0.85, 0.94] to pursue a good accuracy, and then gradually go to the interval [0.96, 1.00] to stabilize the variance of performance. Based on this, given an initial τs and an end threshold τe, τi at the i-th training epoch is calculated as follows" See Eqn. 24. Tau is the parameter that controls the clipping range, tau in the first epoch is interpreted as the first threshold value, the gradient computed this epoch is interpreted as the gradient of the first threshold (tau).) updating a second threshold value of the clipping function based on a gradient of the second threshold value of the clipping function.(Xu [p. 4 4.1] "in Sec. 5, we introduce an adaptive exponential scheduler to identify the range of the “dead weights” in order to seek a balance between the quantization error and the information entropy" See Eqn. 8. [p. 7 §5.2.1] "we further propose an exponential scheduler for adapting τ along the network training. Our motivation lies in that τ should start with a value falling within [0.85, 0.94] to pursue a good accuracy, and then gradually go to the interval [0.96, 1.00] to stabilize the variance of performance. Based on this, given an initial τs and an end threshold τe, τi at the i-th training epoch is calculated as follows" See Eqn. 24. Tau is the parameter that controls the clipping range, tau in the second epoch is interpreted as the second threshold value, the gradient computed this epoch is interpreted as the gradient of the second threshold (tau).). However, Xu doesn't explicitly teach updating the weight set comprises: updating the weight set by subtracting the gradient from the weight set. Stanford, in the same field of endeavor, teaches The method of claim 1, updating the weight set comprises: updating the weight set by subtracting the gradient from the weight set; ([p. 9] "W_new = W - step_size * df [...] Update in negative gradient direction. In the code above, notice that to compute W_new we are making an update in the negative direction of the gradient df since we wish our loss function to decrease, not increase." stochastic gradient descent by definition updates the weights by subtracting the gradient from the weight set.). Stanford is introduced merely to reinforce the obviousness of the claim and to show that the training in Xu which is explicitly backpropagation with stochastic gradient descent by definition requires updating the weight set by subtracting the gradient from the weight set. Claims 11, 16, 17, 18, and 19 are rejected under U.S.C. §103 as being unpatentable over the combination of Xu and Guo (US20200082264A1). Regarding claim 11, Xu teaches The method of claim 10, wherein storing the result of the training operation comprises: storing the weight set, (Xu [pp. 15-16] "Quantized model link" provides links to stored binarized weight sets produced by the method.). However, Xu doesn't explicitly teach a first threshold value of the clipping function, and a second threshold value of the clipping function. . Guo, in the same field of endeavor, teaches a first threshold value of the clipping function, and a second threshold value of the clipping function. ([¶0212] "a first goal is to find a binary expansion of W that approximates it well (as illustrated in FIG. 16, which means W ≈ 〈 B , a 〉 = ∑ j = 0 m - 1  a j  B j , in which and B∈{+1,−1}c×w×h×m and a∈Rm are the concatenations of m binary tensors{B0, . . . , Bm-1} and the same number of scale factors {a0, . . . , am-1}, respectively. The appropriate choice of B and a with a fixed m can be investigated. In particular, FIG. 16 shows approximating the real-valued weight tensor with a sum of binary scaled tensors" [¶0388] "apparatus comprises a memory to store input initial, intermediate, and final results" Guo explicitly teaches that the weight tensors comprise the binary tensors as well as scale factors (interpreted as threshold values of the clipping function), which are explicitly stored in memory.). Xu as well as Guo are directed towards range based neural network quantization. Therefore, Xu as well as Guo are analogous art in the same field of endeavor. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Xu with the teachings of Guo by applying the binarization it on a memory system and storing the weights (variables) in memory. While it would be obvious to one of ordinary skill in the art to store variables in memory on a computer, this is reinforced by Guo who provides as additional motivation for combination ([¶0204] "The disclosed examples and embodiments introduce network sketching as a new way of pursuing binary-weight CNNs [...] the disclosed examples and embodiments propose two theoretical grounded algorithms, making it possible to regulate the precision of sketching for more accurate inference. Moreover, to further improve the efficiency of generated models (a.k.a., sketches), the disclosed examples and embodiments propose an algorithm to associatively implement binary tensor convolutions, with which the required number of floating-point additions and subtractions (FADDs) is likewise reduced."). Regarding claim 16, Xu teaches generating a binarized weight set by applying a clipping function to a weight set; ([Abstract] "In this paper, for the first time, we explore the influence of “dead weights” which refer to a group of weights that are barely updated during the training of BNNs, and then introduce rectified clamp unit (ReCU) to revive the “dead weights” for updating" clamp unit interpreted as clipping function.) generating output data by sequentially performing a forward computation on a binarized neural network based on input data and the binarized weight set; ([p. 6 §4.3] "in the forward propagation, we first standardize W and revive the “dead weights” using Eq. (22) and ReCU of Eq. (8), respectively. Then, we compute the scaling factor α using Eq. (13), and binarize the inputs and the revived weights using the sign function of Eq. (2). Finally, we complete the binary convolution using Eq. (3) for the forward propagation") generating a gradient of the weight set by sequentially performing a backward computation on the binarized neural network based on loss calculated from the output data; and ([p. 6 §4.3] "During backpropagation, we derive the gradients w.r.t. W and A using Eq. (5) and Eq. (6), respectively, and update W using the stochastic gradient descent (SGD) described in Sec. 5.1.") training the binarized neural network, comprising: ([p. 2 §1] "In this paper, we present a novel perspective to improve the effectiveness and training efficiency of BNN" [p. 6 §5.1] "Training Details. Our network is trained from scratch without depending on a pre-trained model. For all experiments, we use SGD for optimization with a momentum of 0.9 and the weight-decay is set to 5e-4") updating the weight set based on the gradient of the weight set; and ([p. 6 §4.3] "During backpropagation, we derive the gradients w.r.t. W and A using Eq. (5) and Eq. (6), respectively, and update W using the stochastic gradient descent (SGD) described in Sec. 5.1") changing a range of the clipping function. ([p. 4 4.1] "in Sec. 5, we introduce an adaptive exponential scheduler to identify the range of the “dead weights” in order to seek a balance between the quantization error and the information entropy" See Eqn. 8. [p. 7 §5.2.1] "we further propose an exponential scheduler for adapting τ along the network training. Our motivation lies in that τ should start with a value falling within [0.85, 0.94] to pursue a good accuracy, and then gradually go to the interval [0.96, 1.00] to stabilize the variance of performance. Based on this, given an initial τs and an end threshold τe, τi at the i-th training epoch is calculated as follows" See Eqn. 24. Tao is the parameter that controls the clipping range.) wherein the clipping function includes a first threshold value and a second threshold value, and wherein changing the range of the clipping function comprises using a gradient of the first threshold value of the clipping function ([p. 4 4.1] "in Sec. 5, we introduce an adaptive exponential scheduler to identify the range of the “dead weights” in order to seek a balance between the quantization error and the information entropy" See Eqn. 8. [p. 7 §5.2.1] "we further propose an exponential scheduler for adapting τ along the network training. Our motivation lies in that τ should start with a value falling within [0.85, 0.94] to pursue a good accuracy, and then gradually go to the interval [0.96, 1.00] to stabilize the variance of performance. Based on this, given an initial τs and an end threshold τe, τi at the i-th training epoch is calculated as follows" See Eqn. 24. Tau is the parameter that controls the clipping range, tau in the first epoch is interpreted as the first threshold value, the gradient computed this epoch is interpreted as the gradient of the first threshold (tau). Tau is changed from first value to second value to third value.). However, Xu does not explicitly teach A memory device comprising: processing logic; and a memory core including a plurality of memory cells and comprising data embodied in the memory cells that is executable by the processing logic to perform operations comprising:. Guo, in the same field of endeavor, teaches A memory device comprising: processing logic; and ([¶0062] "a local instance of the parallel processor memory 222 may be excluded in favor of a unified memory design that utilizes system memory in conjunction with local cache memory.") a memory core including a plurality of memory cells and comprising data embodied in the memory cells that is executable by the processing logic to perform operations comprising:([¶0062] "the memory units 224A-224N can include various types of memory devices, including dynamic random access memory (DRAM) or graphics random access memory, such as synchronous graphics random access memory (SGRAM), including graphics double data rate (GDDR) memory. In one embodiment, the memory units 224A-224N may also include 3D stacked memory, including but not limited to high bandwidth memory (HBM). Persons skilled in the art will appreciate that the specific implementation of the memory units 224A-224N can vary, and can be selected from one of various conventional designs"). Xu as well as Guo are directed towards range based neural network quantization. Therefore, Xu as well as Guo are analogous art in the same field of endeavor. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Xu with the teachings of Guo by applying the binarization it on a memory system and storing the weights (variables) in memory. While it would be obvious to one of ordinary skill in the art to store variables in memory on a computer, this is reinforced by Guo who provides as additional motivation for combination ([¶0204] "The disclosed examples and embodiments introduce network sketching as a new way of pursuing binary-weight CNNs [...] the disclosed examples and embodiments propose two theoretical grounded algorithms, making it possible to regulate the precision of sketching for more accurate inference. Moreover, to further improve the efficiency of generated models (a.k.a., sketches), the disclosed examples and embodiments propose an algorithm to associatively implement binary tensor convolutions, with which the required number of floating-point additions and subtractions (FADDs) is likewise reduced."). Regarding claim 17, the combination of Xu and Guo teaches The memory device of claim 16, wherein the operations further comprise: storing the weight set(Xu [pp. 15-16] "Quantized model link" provides links to stored binarized weight sets produced by the method.) a first threshold value of the clipping function, and a second threshold value of the clipping function as a result of training the binarized neural network;(Guo [¶0212] "a first goal is to find a binary expansion of W that approximates it well (as illustrated in FIG. 16, which means W ≈ 〈 B , a 〉 = ∑ j = 0 m - 1  a j  B j , in which and B∈{+1,−1}c×w×h×m and a∈Rm are the concatenations of m binary tensors{B0, . . . , Bm-1} and the same number of scale factors {a0, . . . , am-1}, respectively. The appropriate choice of B and a with a fixed m can be investigated. In particular, FIG. 16 shows approximating the real-valued weight tensor with a sum of binary scaled tensors" [¶0388] "apparatus comprises a memory to store input initial, intermediate, and final results" Guo explicitly teaches that the weight tensors comprise the binary tensors as well as scale factors (interpreted as threshold values of the clipping function), which are explicitly stored in memory.) generating the binarized weight set based on the stored weight set, (Xu [p. 6 §5.1] "Network Structures. On CIFAR-10, we evaluate ReCU with ResNet-18/20 [21] and VGG-Small [51]. Following the compared methods, we binarize all convolutional and fully-connected layers except the first and the last ones" [pp. 15-16] "Quantized model link" provides links to stored binarized weight sets produced by the method.) the stored first threshold value of the clipping function, and the stored second threshold value of the clipping function; and(Guo [¶0212] "As described above, a first goal is to find a binary expansion of W that approximates it well (as illustrated in FIG. 16, which means W ≈ 〈 B , a 〉 = ∑ j = 0 m - 1  a j  B j , in which and B∈{+1,−1}c×w×h×m and a∈Rm are the concatenations of m binary tensors{B0, . . . , Bm-1} and the same number of scale factors {a0, . . . , am-1}, respectively. The appropriate choice of B and a with a fixed m can be investigated. In particular, FIG. 16 shows approximating the real-valued weight tensor with a sum of binary scaled tensors.") operating the binarized neural network in inference mode using the generated binarized weight set.(Xu [p. 6 §4.3] "in the forward propagation, we first standardize W and revive the “dead weights” using Eq. (22) and ReCU of Eq. (8), respectively. Then, we compute the scaling factor α using Eq. (13), and binarize the inputs and the revived weights using the sign function of Eq. (2). Finally, we complete the binary convolution using Eq. (3) for the forward propagation"). Regarding claim 18, the combination of Xu and Guo teaches The memory device of claim 16, wherein the operations further comprise: storing the binarized weight set as a result of training the binarized neural network; and(Guo [¶0377] " approximating trained filters of the trained CNN by determining a basis of binary tensors and a series of scale factors." [¶0212] "a first goal is to find a binary expansion of W that approximates it well (as illustrated in FIG. 16, which means W ≈ 〈 B , a 〉 = ∑ j = 0 m - 1  a j  B j , in which and B∈{+1,−1}c×w×h×m and a∈Rm are the concatenations of m binary tensors{B0, . . . , Bm-1} and the same number of scale factors {a0, . . . , am-1}, respectively. The appropriate choice of B and a with a fixed m can be investigated. In particular, FIG. 16 shows approximating the real-valued weight tensor with a sum of binary scaled tensors" [¶0388] "apparatus comprises a memory to store input initial, intermediate, and final results") operating the binarized neural network in inference mode using the stored binarized weight set.(Xu [p. 6 §4.3] "in the forward propagation, we first standardize W and revive the “dead weights” using Eq. (22) and ReCU of Eq. (8), respectively. Then, we compute the scaling factor α using Eq. (13), and binarize the inputs and the revived weights using the sign function of Eq. (2). Finally, we complete the binary convolution using Eq. (3) for the forward propagation"). Regarding claim 19, the combination of Xu and Guo teaches The memory device of claim 16, wherein the binarized neural network is trained using the memory device or a binarized neural network training device located outside the memory device.(Guo [¶0018] "FIG. 11 illustrates exemplary embodiment of training and deployment of a deep neural network." [¶0168] "The training framework 604 can hook into an untrained neural network 1106 and enable the untrained neural net to be trained using the parallel processing resources described herein to generate a trained neural net 1108."). Conclusion THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. Any inquiry concerning this communication or earlier communications from the examiner should be directed to SIDNEY VINCENT BOSTWICK whose telephone number is (571)272-4720. The examiner can normally be reached M-F 7:30am-5:00pm EST. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571)270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /SIDNEY VINCENT BOSTWICK/Examiner, Art Unit 2124 /MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124
Read full office action

Prosecution Timeline

Feb 20, 2023
Application Filed
Oct 03, 2025
Non-Final Rejection — §102, §103, §112
Nov 07, 2025
Interview Requested
Nov 13, 2025
Examiner Interview Summary
Nov 13, 2025
Applicant Interview (Telephonic)
Jan 06, 2026
Response Filed
Feb 24, 2026
Final Rejection — §102, §103, §112 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12561604
SYSTEM AND METHOD FOR ITERATIVE DATA CLUSTERING USING MACHINE LEARNING
2y 5m to grant Granted Feb 24, 2026
Patent 12547878
Highly Efficient Convolutional Neural Networks
2y 5m to grant Granted Feb 10, 2026
Patent 12536426
Smooth Continuous Piecewise Constructed Activation Functions
2y 5m to grant Granted Jan 27, 2026
Patent 12518143
FEEDFORWARD GENERATIVE NEURAL NETWORKS
2y 5m to grant Granted Jan 06, 2026
Patent 12505340
STASH BALANCING IN MODEL PARALLELISM
2y 5m to grant Granted Dec 23, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

3-4
Expected OA Rounds
52%
Grant Probability
90%
With Interview (+38.2%)
4y 7m
Median Time to Grant
Moderate
PTA Risk
Based on 136 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month