Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 10/31/2025 has been entered.
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
The following is a quotation of pre-AIA 35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is invoked.
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph:
(A) the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function;
(B) the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and
(C) the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function.
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function.
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function.
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. Each of claims 1-19 have the form “Apparatus … configured to” perform two or more limitations recited in functional language. As such, all the limitations in claims 1-19 are being interpreted under 35 U.S.C. 112(f).
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph.
Response to Arguments
Regarding the rejection of claims under 35 U.S.C. 103, Applicant submits that amended claim 1 is not taught by the combination of Wang and Chen. In particular, Applicant submits that Chen differs from the claimed invention in how relevance scores are distributed among the nodes in the neural network, stating in part, on pages 10-11:
In the prior art referred to by the Examiner, namely Cheng, the back propagation of the relevance scores is done according to these weights. That is, considering the wording in the independent claims, the dispute is about "further fractions at which activations of the predecessor nodes contribute to an activation of the predetermined node in the one or more influences." The phrase in question has actually been formulated with the understanding that the term "(further) fraction" in connection with "contribute" indicates that these "further fractions" relate to, i.e. are fractions of, the activation of the "Y" node, which finally results from the inbound activation contributions from the predecessor nodes ("G"), i.e. with the understanding that an accumulation of all these further fractions yields one. The Examiner, however, interprets the phrase differently or broadly to the extent that the phrase only defines that the activation of each predecessor node ("G") contributes to the "Y" node to a certain extent (further fraction), i.e. a fraction relating to, or of, its own activation (of the corresponding "G" node).
In support, Applicant cites paragraph 0023 of the specification:
In particular, in accordance with an embodiment, the relevance score determination is done by back propagating an initial relevance score at an output node of the ML predictor which may be set depending on an output activation manifesting itself at the output node of the ML predictor in one or more inferences towards input nodes of the ML predictor such as to a scaled version of that output activation, or which may be set to a default output value. In doing so, a relevance score at a predetermined node of the ML predictor is distributed onto predecessor nodes of the predetermined node according to fractions which correspond to further fractions at which activations of the predecessor nodes contribute to an activation of the predetermined node in the one or more inferences.
The relevant language of amended claim 1 is:
determine the relevance scores for the nodes and/or the node interconnections of the ML predictor by
back propagating an initial relevance score at an output of the ML predictor by
distributing a relevance score at a predetermined node of the ML predictor onto predecessor nodes of the predetermined node by, for each predecessor node,
determining a fraction based on a product between
an activations of the respective predecessor nodes contributing, in the one or more inferences, to an activation of the predetermined node by weighing the activation of the respective predecessor node with a weight of the ML predictor between the respective predecessor node and the predetermined node, and
the weight of the ML predictor between the respective predecessor node and the predetermined node, and
distributing the fraction of the relevance score at the predetermined node to the respective predecessor node.
Examiner objects to this wording below, but identifies “the respective predecessor node” in “the weight of the ML predictor between the respective predecessor node and the predetermined node” as the “predecessor node” in “distributing a relevance score at a predetermined node of the ML predictor onto predecessor nodes of the predetermined node by, for each predecessor node.”
This process is taught by Chen. Chen teaches the backpropagation of relevance scores (identified as importance scores) in Chen, paragraph 0027, “Accordingly, DCNN optimization component can perform Importance Score Back Propagation (ISBP) for optimizing a CNN. In one embodiment, the DCNN optimization component 140 can apply feature ranking on higher level features of the CNN, e.g. the inputs of the classifier. The DCNN optimization component 140 can then back propagate the importance scores to the lower layers of the CNN. Through the use of ISBP, the DCNN optimization component 140 can efficiently measure the importance of feature extractors of an entire deep neural network and can do so consistently across the network,” which teaches “back propagating an initial relevance score at an output of the ML predictor by distributing a relevance score at a predetermined node of the ML predictor onto predecessor nodes of the predetermined node.”
Chen teaches that these relevance scores are distributed proportionally according to the weights, shown in Chen, paragraph 0029, “For a neuron A, we back propagate the neuron's importance score to the neurons in the previous layer that are either fully connected (FC) layers or locally connected (convolutional layers) to A, proportionally to the weights of the connections,” which teaches “determining a fraction based on a product between an activations of the respective predecessor nodes contributing, in the one or more inferences, to an activation of the predetermined node by weighing the activation of the respective predecessor node with a weight of the ML predictor between the respective predecessor node and the predetermined node, and the weight of the ML predictor between the respective predecessor node and the predetermined node.“
The relevance score is backpropagated layer by layer, such that the relevance score at earlier layers is dependent on the accumulated fractions of relevance from later layers, as shown in Chen, paragraph 0034, “Generally, the DCNN optimization component 140 can use importance score back propagation (ISBP) to transfer the importance from downstream kernels and neurons of the CNN to upstream kernels and neurons of the CNN, based on learned weights in CNN model. For example, given the importance score of a specific neuron, the DCNN optimization component 140 can identify the neurons in the previously used to calculate the activation of neuron, then back propagate the importance to the neurons proportionally to the weights corresponding to the operation of that layer,” which teaches “distributing the fraction of the relevance score at the predetermined node to the respective predecessor node.”
The argument is therefore found unpersuasive.
Claim Objections
Claims 1-3 and 5-21 objected to because of the following informality. These claims recite “distributing a relevance score at a predetermined node of the ML predictor onto predecessor nodes of the predetermined node by, for each predecessor node, determining a fraction based on a product between an activations of the respective predecessor nodes contributing, in the one or more inferences, to an activation of the predetermined node by weighing the activation of the respective predecessor node with a weight of the ML predictor between the respective predecessor node and the predetermined node, and the weight of the ML predictor between the respective predecessor node and the predetermined node.”
In this limitation, Examiner identifies the “the respective predecessor node” in “the weight of the ML predictor between the respective predecessor node and the predetermined node” with “predecessor node” in “distributing a relevance score at a predetermined node of the ML predictor onto predecessor nodes of the predetermined node by, for each predecessor node […]” That is, the “fraction” is a fraction of the weight from a particular predecessor node (the particular predecessor node for which the relevance is being determined) to the predetermined node compared to the weights of all the predecessor nodes of the predetermined node (as described on pages 13-14 of the originally filed specification, “That is, the relevance fraction back propagated from a certain node such as node 20 to a certain predecessor node thereof may be computed by multiplying the relevance of that node with a factor depending on a ratio between the activation received from that predecessor node times the weight using which the activation has contributed to the aforementioned sum of the respective node, divided by a value depending on a sum of all products between the activations of the predecessor nodes and the weights at which these activations have contributed to the weighted sum of the current node the relevance of which is to be back propagated”).
However, Examiner considers that the wording of this limitation, and in particular the use of “respective” in “the weight of the ML predictor between the respective predecessor node and the predetermined node,” renders this identification unclear. Examiner recommends rewriting this limitation for clarity.
Appropriate correction is required.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-3, 5-8, 10, and 20-21 rejected under 35 U.S.C. 103 over Wang et al., US Pre-Grant Publication No. 2022/0083866 (hereafter Wang) in view of Chen et al., US Pre-Grant Publication No. 2019/0122113 (hereafter Chen).
Regarding claim 1 and analogous claims 20-21:
Wang teaches:
“Apparatus for pruning and/or quantizing of a machine learning (ML) predictor”: Wang, paragraph 0005, “According to a first aspect, there is provided an apparatus [Apparatus] comprising means for performing: training a neural network by applying an optimization loss function, wherein the optimization loss function considers empirical errors and model redundancy; pruning a trained neural network [pruning and/or quantizing of a machine learning (ML) predictor] by removing one or more filters that have insignificant contributions from a set of filters; and providing the pruned neural network for transmission.”
“the apparatus being configured to determine relevance scores for portions of the ML predictor on the basis of an activation of the portions of the ML predictor manifesting itself in one or more inferences performed by the ML predictor”: Wang, paragraph 0057, “As another example, scaling factor based pruning may be applied. The filters of the set of filters may be ranked based on importance scaling factors. For example, a BatchNormalization (BN) based scaling factor may be used to quantify the importance of different filters [determine relevance scores for portions of the ML predictor].”
“prune and/or quantize the ML predictor using the relevance scores”: Wang, paragraph 0057, “The filters may be arranged in descending order of the scaling factor, e.g. the BN-based scaling factor. The filters that are below a threshold percentile p % of the ranked filters may be pruned [prune and/or quantize the ML predictor using the relevance scores].”
“wherein the ML predictor comprises nodes and node interconnections”: Wang, paragraph 0030, “A neural network (NN) is a computation graph comprising several layers of computation. Each layer comprises one or more units, where each unit performs an elementary computation. A unit [nodes] is connected to one or more other units [node interconnections], and the connection may have associated a weight.”
“and the apparatus is configured to determine the relevance scores for the nodes and/or the node interconnections of the ML predictor“: Wang, paragraph 0056, “For example, in diversity based pruning, the filters of the set of filters may be ranked based on column-wise summation of the diversity matrix (1). These summations may be used to quantify the diversity of a given filter with regard to other filters in the set of filters [determine the relevance scores for the nodes and/or the node interconnections].”
Wang does not explicitly teach
“by back propagating an initial relevance score at an output of the ML predictor by distributing a relevance score at a predetermined node of the ML predictor onto predecessor nodes of the predetermined node”
“by, for each predecessor node, determining a fraction based on a product between an activations of the respective predecessor nodes contributing, in the one or more inferences, to an activation of the predetermined node by weighing the activation of the respective predecessor node with a weight of the ML predictor between the respective predecessor node and the predetermined node, and the weight of the ML predictor between the respective predecessor node and the predetermined node”
“and distributing the fraction of the relevance score at the predetermined node to the respective predecessor node.”
Chen teaches:
“by back propagating an initial relevance score at an output of the ML predictor by distributing a relevance score at a predetermined node of the ML predictor onto predecessor nodes of the predetermined node”: Chen, paragraph 0027, “Accordingly, DCNN optimization component can perform Importance Score Back Propagation (ISBP) for optimizing a CNN. In one embodiment, the DCNN optimization component 140 can apply feature ranking on higher level features of the CNN, e.g. the inputs of the classifier. The DCNN optimization component 140 can then back propagate the importance scores to the lower layers of the CNN [back propagating an initial relevance score at an output of the ML predictor by distributing a relevance score at a predetermined node of the ML predictor onto predecessor nodes of the predetermined node]. Through the use of ISBP, the DCNN optimization component 140 can efficiently measure the importance of feature extractors of an entire deep neural network and can do so consistently across the network,”
“by, for each predecessor node, determining a fraction based on a product between an activations of the respective predecessor nodes contributing, in the one or more inferences, to an activation of the predetermined node by weighing the activation of the respective predecessor node with a weight of the ML predictor between the respective predecessor node and the predetermined node, and the weight of the ML predictor between the respective predecessor node and the predetermined node”: Chen, paragraph 0029, “For a neuron A, we back propagate the neuron's importance score to the neurons in the previous layer that are either fully connected (FC) layers or locally connected (convolutional layers) to A, proportionally to the weights of the connections”; Chen, paragraph 0034, “Generally, the DCNN optimization component 140 can use importance score back propagation (ISBP) to transfer the importance from downstream kernels and neurons of the CNN to upstream kernels and neurons of the CNN, based on learned weights in CNN model. For example, given the importance score of a specific neuron [predetermined node], the DCNN optimization component 140 can identify the neurons in the previously used to calculate the activation of neuron [predecessor nodes], then back propagate the importance to the neurons proportionally to the weights corresponding to the operation of that layer [by, for each predecessor node, determining a fraction based on a product between an activations of the respective predecessor nodes contributing, in the one or more inferences, to an activation of the predetermined node by weighing the activation of the respective predecessor node with a weight of the ML predictor between the respective predecessor node and the predetermined node, and the weight of the ML predictor between the respective predecessor node and the predetermined node]”
“and distributing the fraction of the relevance score at the predetermined node to the respective predecessor node”: Chen, paragraph 0034, “Generally, the DCNN optimization component 140 can use importance score back propagation (ISBP) to transfer the importance from downstream kernels and neurons of the CNN to upstream kernels and neurons of the CNN, based on learned weights in CNN model. For example, given the importance score of a specific neuron, the DCNN optimization component 140 can identify the neurons in the previously used to calculate the activation of neuron, then back propagate the importance to the neurons [distributing the fraction of the relevance score at the predetermined node to the respective predecessor node] proportionally to the weights corresponding to the operation of that layer.”
Chen and Wang are both related to the same field of endeavor (pruning of neural networks). It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the claimed invention to have combined the importance backpropagation of Chen to the teachings of Wang to arrive at the present invention, in order to better determine the importance of individual network components in selecting components to prune, as stated in Chen, paragraph 0015, “As such, embodiments described herein provide general processing flow that includes quantifying the importance of the features of convolutional kernels and neurons for each layer of a pre-trained CNN model, and further includes pruning the less important extractors to obtain a compact CNN structure.”
Regarding claim 2:
Wang as modified by Chen teaches the apparatus of claim 1.
Wang further teaches “configured to use a pruned and/or quantized version of the ML predictor which results from the pruning and/or quantizing, to perform one or more further inferences, and recursively repeat the determining the relevance scores and the pruning and/or quantizing on the basis of an activation of the portions of the ML predictor manifesting itself in the one or more further inferences“: Wang, paragraph 0095, “The apparatus may be further caused to, for minibatches of a training stage [one or more further inferences]: rank filters of the set of filters according to scaling factors; select the filters that are below a threshold percentile of the ranked filters; prune the selected filters temporarily during optimization of one of the minibatches; iteratively repeat the ranking, selecting and pruning for the mini-batches [recursively repeat the determining the relevance scores and the pruning and/or quantizing on the basis of an activation of the portions of the ML predictor manifesting itself in the one or more further inferences].”
Regarding claim 3:
Wang as modified by Chen teaches the apparatus of claim 1.
Wang further teaches “configured to subject a non-pruned-away and/or not quantized to zero portion of the ML predictor which results from pruning and/or quantizing, to training using training data“: Wang, paragraph 0095, “The apparatus may be further caused to, for minibatches of a training stage: rank filters of the set of filters according to scaling factors; select the filters that are below a threshold percentile of the ranked filters; prune the selected filters temporarily during optimization of one of the minibatches; iteratively repeat the ranking, selecting and pruning for the mini-batches [showing pruning occurs during training, hence, a non-pruned-away and/or not quantized to zero portion of the ML predictor which results from pruning and/or quantizing is subject to training using training data].”
Regarding claim 5:
Wang as modified by Chen teaches the apparatus of claim 1.
Wang further teaches “configured to determine the relevance score for a predetermined portion of the ML predictor, composed of more than one node and/or node inter connection of the ML predictor by aggregating the relevance scores of the more than one node and/or node interconnection the predetermined portion is composed of”: Wang, paragraph 0056, “The trained neural network may be pruned by removing one or more filters that have insignificant contribution from a set of filters. There are alternative pruning schemes. For example, in diversity based pruning, the filters of the set of filters may be ranked based on column-wise summation of the diversity matrix (1) [the relevance score for a predetermined portion of the ML predictor, composed of more than one node and/or node inter connection of the ML predictor by aggregating the relevance scores of the more than one node and/or node interconnection the predetermined portion is composed of]. These summations may be used to quantify the diversity of a given filter with regard to other filters in the set of filters.”
Regarding claim 6:
Wang as modified by Chen teaches the apparatus of claim 5.
Wang further teaches “configured to determine the predetermined portion by analyzing the distribution of relevance scores over the ML predictor“: Wang, section 0056, “Wang, paragraph 0056, “The trained neural network may be pruned by removing one or more filters that have insignificant contribution from a set of filters. There are alternative pruning schemes. For example, in diversity based pruning, the filters of the set of filters may be ranked based on column-wise summation of the diversity matrix (1). These summations may be used to quantify the diversity of a given filter with regard to other filters in the set of filters [determine the predetermined portion by analyzing the distribution of relevance scores over the ML predictor]. The filters may be arranged in descending order of the column-wise summations of the diversities. The filters that are below a threshold percentile p % of the ranked filters may be pruned.”
Regarding claim 7:
Wang as modified by Chen teaches the apparatus of claim 1.
Wang further teaches “configured to, in pruning and/or quantizing the ML predictor using the relevance scores, prune away predetermined portions of the ML predictor whose relevance according to the relevance score determined for the predetermined portions is lower than a predetermined threshold”: Wang, paragraph 0057, “The filters may be arranged in descending order of the scaling factor, e.g. the BN-based scaling factor. The filters that are below a threshold percentile p % of the ranked filters may be pruned [according to the relevance score determined for the predetermined portions is lower than a predetermined threshold].”
Regarding claim 8:
Wang as modified by Chen teaches the apparatus of claim 1.
Wang further teaches “configured to, in pruning and/or quantizing the ML predictor using the relevance scores, prune away predetermined first portions of the ML predictor whose relevance according to the relevance score determined for the predetermined nodes fulfills a predetermined criterion, and second portions which contribute to an output of the ML predictor via the first portions exclusively”: Wang, paragraph 0077, “Clearly one can observe based on the line 430 that, once the pruning loss is incorporated into the optimization objective function, i.e. minimization objective function, scaling factors associated with pruned filters are significantly suppressed [hence, the scaling factors of second portions which contribute to an output of the ML predictor via the first portions exclusively are minimized] while scaling factors are enhanced for remaining filters”; Wang, paragraph 0095, “The apparatus may be further caused to, for mini-batches of a training stage: rank filters of the set of filters according to scaling factors; select the filters that are below a threshold percentile of the ranked filters; prune the selected filters temporarily during optimization of one of the mini batches [prune away predetermined first portions of the ML predictor whose relevance according to the relevance score determined for the predetermined nodes fulfills a predetermined criterion]; iteratively repeat the ranking, selecting and pruning for the mini-batches [prune away … second portions which contribute to an output of the ML predictor via the first portions exclusively].”
Regarding claim 10:
Wang as modified by Chen teaches the apparatus of claim 1.
Wang further teaches “configured to, in pruning and/or quantizing the ML predictor using the relevance scores, prune away predetermined portions of the ML predictor whose relevance according to the relevance score determined for the predetermined portions is lower than the relevance of more than a predetermined fraction of portions of the ML predictor“: Wang, section 0056, “The trained neural network may be pruned by removing one or more filters that have insignificant contribution from a set of filters. There are alternative pruning schemes. For example, in diversity based pruning, the filters of the set of filters may be ranked based on column-wise summation of the diversity matrix (1). These summations may be used to quantify the diversity of a given filter with regard to other filters in the set of filters. The filters may be arranged in descending order of the column-wise summations of the diversities. The filters that are below a threshold percentile p % of the ranked filters may be pruned [prune away predetermined portions of the ML predictor whose relevance according to the relevance score determined for the predetermined portions is lower than the relevance of more than a predetermined fraction of portions of the ML predictor].”
Claim 9 rejected under 35 U.S.C. 103 over Wang as modified by Chen in view of He et al., “Channel Pruning for Accelerating Very Deep Neural Networks,” 2017, arXiv:1707.06168 (hereafter He).
Wang as modified by Chen teaches the apparatus of claim 7.
Wang further teaches (bold only) “configured to, in pruning and/or quantizing the ML predictor using the relevance scores, perform the pruning away so that the predetermined threshold decreases towards an output of the ML predictor”: Wang, paragraph 0057, “The filters may be arranged in descending order of the scaling factor, e.g. the BN-based scaling factor. The filters that are below a threshold percentile p % of the ranked filters may be pruned [in pruning and/or quantizing the ML predictor using the relevance scores].”
Wang as modified by Chen does not explicitly teach (bold only) “configured to, in pruning and/or quantizing the ML predictor using the relevance scores, perform the pruning away so that the predetermined threshold decreases towards an output of the ML predictor”
He teaches (bold only) “configured to, in pruning and/or quantizing the ML predictor using the relevance scores, perform the pruning away so that the predetermined threshold decreases towards an output of the ML predictor”: He, section 4.1.1, paragraph 3, “Also notice that channel pruning gradually becomes hard, from shallower to deeper layers. It indicates that shallower layers have much more redundancy, which is consistent with [52]. We could prune more aggressively on shallower layers in whole model acceleration [perform the pruning away so that the predetermined threshold decreases towards an output of the ML predictor].”
He and Wang are both related to the same field of endeavor (pruning of neural networks). Wang teaches the use of a relevance threshold in the determination of pruning model components. He teaches that model components may be more aggressively pruned in shallower layers (i.e., layers further from the output). It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the claimed invention to have combined the aggressive earlier-layer pruning of He to the threshold-based pruning of Wang to arrive at the present invention, in order to take advantage of shallow layer redundancy to prune more of the model, as stated in He, section 4.1.1, paragraph 3, “Also notice that channel pruning gradually becomes hard, from shallower to deeper layers. It indicates that shallower layers have much more redundancy, which is consistent with [52]. We could prune more aggressively on shallower layers in whole model acceleration.”
Claims 11-15 and 17 rejected under 35 U.S.C. 103 over Wang as modified by Chen in view of Choi et al., US Pre-Grant Publication No. 2018/0107926 (hereafter Choi).
Regarding claim 11:
Wang as modified by Chen teaches the apparatus of claim 1.
Wang further teaches (bold only) “configured to prune and/or quantize the ML predictor using an optimization scheme with an objective function which depends on a weighted distance between quantized weights and unquantized weights of the ML predictor, weighted based on the relevance scores”: Wang, paragraph 0048, “In the method disclosed herein, the optimization loss function [an optimization scheme], i.e. the objective function [an objective function] of filter diversity enhanced NN learning may be formulated by:
PNG
media_image1.png
34
220
media_image1.png
Greyscale
wherein λ is the parameter to control relative significance of the original task and the filter diversity enhancement term KΘ, and Θ is the parameter to measure filter diversities used in function K [weighted based on the relevance scores]. W* above represents the first loss function.”
Wang as modified by Chen does not explicitly teach (bold only) “configured to prune and/or quantize the ML predictor using an optimization scheme with an objective function which depends on a weighted distance between quantized weights and unquantized weights of the ML predictor, weighted based on the relevance scores.”
Choi teaches (bold only) “configured to prune and/or quantize the ML predictor using an optimization scheme with an objective function which depends on a weighted distance between quantized weights and unquantized weights of the ML predictor, weighted based on the relevance scores”: Choi, paragraph 0077, “In order to, inter alia, address these issues with conventional k-means clustering, the present disclosure describes (1) utilizing the second-order partial derivatives, i.e., the diagonal of the Hessian matrix, of the loss function with respect to the network parameters as a measure of the significance of different network parameters [weighted based on the relevance scores]; and (2) solving the network quantization problem under a constraint [an optimization scheme with an objective function] of the actual compression ratio resulting from the specific binary encoding scheme that was employed [which depends on a weighted distance between quantized weights and unquantized weights of the ML predictor]. Accordingly, the description below is broken into three sections: I. Network Quantization [configured to prune and/or quantize the ML predictor] using Hessian-Weight; II. Entropy-Constrained Network Quantization; and III. Experimental/Simulation Results.”
Choi and Wang are both related to the same field of endeavor (compression of neural networks). Wang teaches a neural network reduction method using relevancy-based pruning. Choi teaches a neural network quantization method using of a k-means clustering algorithm incorporating relevancy. It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the claimed invention to have combined the quantization scheme of Choi to the pruning teachings of Wang to arrive at the present invention, in order to further reduce the complexity of a model, as stated in Choi, paragraphs 0006-0008, “As mentioned above, there may be hundreds of millions of network parameters/weights which require a significant amount of memory to be stored. Accordingly, although deep neural networks are extremely powerful, they also require a significant amount of resources to implement, particularly in terms of memory storage. […] This makes it difficult to deploy deep neural networks on devices with limited storage, such as mobile/portable devices. Accordingly, the present disclosure has been made to address at least the problems and/or disadvantages described herein and to provide at least the advantages described below.”
Regarding claim 12:
Wang as modified by Chen and Choi teaches the apparatus of claim 11.
Choi further teaches “wherein the objective function depends on a sum of the weighted distance and a code length of the quantized weights”: Choi, paragraphs 0072-0073: “Assuming k-means clustering is used with variable-length encoding, let Ci be the number of network parameters in cluster i, where 1<=i<=k, and bi is the number of bits of the codeword assigned for the network parameters in cluster i (i.e., the codeword for cluster i). Thus, the binary codewords are only Σi=1k |Ci|bi bits rather than Nb bits. For a lookup table that stores the k binary codewords (bi bits for 1<=i<=k) with their matching quantized network parameter values (b bits each), an additional Σi=1k bi+kb is needed. Thus, in Equation (6)(a):
PNG
media_image2.png
75
408
media_image2.png
Greyscale
. As seen above, in k-means clustering with variable-length encoding, the compression ratio depends not only on the number of clusters, i.e., k, but also on the sizes of the various clusters, i.e., |Ci|, and the assigned number of bits for each cluster codeword, i.e., bi for 1<=i<=k [hence, the distance between codewords, and therefore the quantization error, depends on a code length of the quantized weights]”; Choi, paragraph 0093, “Next, Equation (7)(c) is connected to the problem of network quantization by treating δw`i as the quantization error of network parameter wi at its local optimum wi= w’i, i.e., δw’i=!wi-w`i, , wheres !wi is a quantized value of w’i. Thus, the local impact of quantization on the average loss function at w=w` can be quantified approximately as Equation (7)(d) [the objective function depends on a sum of the weighted distance].”
Choi and Wang are combinable for the rationale given under claim 11.
Regarding claim 13:
Wang as modified by Chen and Choi teaches the apparatus of claim 11.
Choi further teaches “wherein the optimization scheme is a k-means clustering“: Choi, paragraph 0077, “In order to, inter alia, address these issues with conventional k-means clustering [wherein the optimization scheme is a k-means clustering], the present disclosure describes (1) utilizing the second-order partial derivatives, i.e., the diagonal of the Hessian matrix, of the loss function with respect to the network parameters as a measure of the significance of different network parameters; and (2) solving the network quantization problem under a constraint of the actual compression ratio resulting from the specific binary encoding scheme that was employed. Accordingly, the description below is broken into three sections: I. Network Quantization using Hessian-Weight; II. Entropy-Constrained Network Quantization; and III. Experimental/Simulation Results.”
Choi and Wang are combinable for the rationale given under claim 11.
Regarding claim 14:
Wang as modified by Chen and Choi teaches the apparatus of claim 13.
Choi further teaches:
“for each iteration of the k-means clustering, a cluster formation step associating each ML predictor node interconnection with one of a plurality of quantization values so as to reduce the optimization function”: Choi, paragraph 0104, “At 620, each network parameter is assigned to the cluster whose center is closest to the parameter value. This may be done by partitioning the points according to the Voronoi diagram generated by the cluster centers [a cluster formation step associating each ML predictor node interconnection with one of a plurality of quantization values so as to reduce the optimization function]. At 630, after all of the network parameters have been (re-)assigned in 620, a new set of Hessian-weighted cluster center means are calculated. At 640, the number of iterations is updated ("n= n+ 1 ") and, at 645, it is determined whether the iterative process has ended [for each iteration of the k-means clustering].”
“and a quantizer update step updating each quantization value using a weighted sum of the unquantized weights of the predictor node interconnection associated with the respective quantization value, weighted with a relevance score determined for the respective ML predictor node interconnection”: Choi, paragraph 0086, “In this section, the impact of quantization errors on the loss function of a neural network is analyzed and a Hessian-weight that can be used to quantify the significance of different network parameters [a relevance score] in quantization is derived”; Choi, paragraphs 0097-0100, From Equation (7)( d), the optimal clustering that minimizes the Hessian-weighted distortion measure is given by Equation(8)(a):
PNG
media_image3.png
57
372
media_image3.png
Greyscale
where hii is the second-order partial derivative of the loss function with respect to network parameter wi as shown in Equation (8)(b):
PNG
media_image4.png
62
366
media_image4.png
Greyscale
and cj is the Hessian-weighted [weighted with a relevance score] mean of network parameters in cluster CJ as shown in Equation (8)(c) [using a weighted sum of the unquantized weights of the predictor node interconnection associated with the respective quantization value]:
PNG
media_image5.png
69
375
media_image5.png
Greyscale
.
Choi, paragraph 0104, “At 620, each network parameter is assigned to the cluster whose center is closest to the parameter value. This may be done by partitioning the points according to the Voronoi diagram generated by the cluster centers. At 630, after all of the network parameters have been (re-)assigned in 620, a new set of Hessian-weighted cluster center means are calculated [and a quantizer update step updating each quantization value]. At 640, the number of iterations is updated ("n= n+ 1 ") and, at 645, it is determined whether the iterative process has ended.”
Choi and Wang are combinable for the rationale given under claim 11.
Regarding claim 15:
Wang as modified by Chen and Choi teaches the apparatus of claim 13.
Choi further teaches “configured to repeat performing the k-means clustering with, after each performance, accepting the quantized weights for predictor node interconnections for which the acceptance increases the optimization function less than a predetermined threshold or less than a predetermined fraction of remaining unquantized weights“: Choi, paragraph 0104, “At 620, each network parameter is assigned to the cluster whose center is closest to the parameter value. This may be done by partitioning the points according to the Voronoi diagram generated by the cluster centers. At 630, after all of the network parameters have been (re-)assigned in 620, a new set of Hessian-weighted cluster center means are calculated. At 640, the number of iterations is updated ("n= n+1 ") and, at 645, it is determined whether the iterative process has ended. More specifically, it is determined whether the number of iterations has exceeded a limit ("n>N") or there was no change in assignment at 620 (which means the algorithm has effectively converged) [after each performance, accepting the quantized weights for predictor node interconnections for which the acceptance increases the optimization function less than a predetermined threshold or less than a predetermined fraction of remaining unquantized weights]. If it is determined that the process has ended in 645, a codebook for quantized network parameters is generated at 650. If it is determined that the process has not ended in 645, the process repeats to assignment at 620.”
Choi and Wang are combinable for the rationale given under claim 11.
Regarding claim 17:
Wang as modified by Chen and Choi teaches the apparatus of claim 14.
Wang further teaches (bold only) “configured to re-train the ML predictor with respect to unquantized weights for which no quantized weight has yet been accepted”: Wang, paragraph 0066, “The method may comprise estimating accuracy of the network after pruning. For example, the accuracy of the image classification may be estimated using a known dataset. If the accuracy is below a threshold accuracy, the method may comprise retraining the pruned network. Then the accuracy may be estimated again, and the retraining may be repeated until the threshold accuracy is achieved [configured to re-train the ML predictor with respect to unquantized weights for which no quantized weight has yet been accepted].”
Claim 16 rejected under 35 U.S.C. 103 over Wang as modified by Chen and Choi in view of Zeng et al., “Compressing Deep Neural Network for Facial Landmarks Detection,” 2016, Advances in Brain Inspired Cognitive Systems (hereafter Zeng).
Wang as modified by Chen and Choi teaches the apparatus of claim 15.
Wang further teaches (bold only) “configured to re-train the ML predictor with respect to unquantized weights for which no quantized weight has yet been accepted before any repetition of the k-means clustering”: Wang, paragraph 0066, “The method may comprise estimating accuracy of the network after pruning. For example, the accuracy of the image classification may be estimated using a known dataset. If the accuracy is below a threshold accuracy, the method may comprise retraining the pruned network. Then the accuracy may be estimated again, and the retraining may
be repeated until the threshold accuracy is achieved [configured to re-train the ML predictor with respect to unquantized weights for which no quantized weight has yet been accepted].”
Choi further teaches (bold only) “configured to re-train the ML predictor with respect to unquantized weights for which no quantized weight has yet been accepted before any repetition of the k-means clustering”: Choi, paragraph 0104, “At 620, each network parameter is assigned to the cluster whose center is closest to the parameter value. This may be done by partitioning the points according to the Voronoi diagram generated by the cluster centers. At 630, after all of the network parameters have been (re-)assigned in 620, a new set of Hessian-weighted cluster center means are calculated. At 640, the number of iterations is updated ("n= n+1 ") and, at 645, it is determined whether the iterative process has ended. More specifically, it is determined whether the number of iterations has exceeded a limit ("n>N") or there was no change in assignment at 620 (which means the algorithm has effectively converged). If it is determined that the process has ended in 645, a codebook for quantized network parameters is generated at 650. If it is determined that the process has not ended in 645, the process repeats to assignment at 620 [any repetition of the k-means clustering].”
Choi and Wang are combinable for the rationale given under claim 11.
Wang as modified by Chen and Choi does not explicitly teach (bold only) “configured to re-train the ML predictor with respect to unquantized weights for which no quantized weight has yet been accepted before any repetition of the k-means clustering”
Zeng teaches (bold only) “configured to re-train the ML predictor with respect to unquantized weights for which no quantized weight has yet been accepted before any repetition of the k-means clustering”: Zeng, section 3.3, “We compress the network layer by layer from backward to forward. For every iteration, we use the previously trained model to calculate correlated coefficients, and then prune and quantify one additional layer [hence, after the first iteration, the network is a mix of quantized and unquantized weights]. To retrain the network, we use deep learning tools Caffe [23] and simply modify its convolutional and fully connected layers by adding another two blobs to store index and codes [re-train the ML predictor with respect to unquantized weights for which no quantized weight has yet been accepted before any repetition]. Each time before forward-propagation, we use index and codes to reconstruct the weights.”
Zeng and Wang as modified by Choi are both related to the same field of endeavor (compression of neural networks). Wang as modified by Choi teaches a method of reducing model complexity through an iterative k-means clustering-based quantization. Zeng teaches retraining the model during iteration of quantization. It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the claimed invention to have combined the model retraining of Zeng to the teachings of Wang as modified by Choi to arrive at the present invention, in order to reduce degradation of the network caused by the compression, as stated in Zeng, Abstract, “Network retraining: to reduce training difficulty and performance degradation, we iteratively retrain the network, compressing one layer at a time.”
Claims 18-19 rejected under 35 U.S.C. 103 over Wang as modified by Chen in view of Sivaraman et al., US Pre-Grant Publication No. 2020/0342324 (hereafter Sivaraman).
Regarding claim 18:
Wang as modified by Chen teaches the apparatus of claim 1.
Wang further teaches (bold only) “replace the ML predictor with a pruned and/or quantized version of the ML predictor which results from the pruning and/or quantizing and apply the pruned and/or quantized version of the ML predictor onto further input data such as replenishments of the local input data to subject the further input data to inference”: Wang, paragraph 0066, “The method may comprise estimating accuracy of the network after pruning. For example, the accuracy of the image classification may be estimated using a known data set. If the accuracy is below a threshold accuracy, the method may comprise retraining the pruned network [a pruned and/or quantized version of the ML predictor which results from the pruning and/or quantizing and apply the pruned and/or quantized version of the ML predictor onto further input data]. Then the accuracy may be estimated again, and the retraining may be repeated until the threshold accuracy is achieved.”
Wang as modified by Chen does not explicitly teach:
“configured to retrieve a definition of the ML predictor from a server”
“apply the ML predictor onto local input data so as to make the ML predictor performing the one or more inferences“
(bold only) “replace the ML predictor with a pruned and/or quantized version of the ML predictor which results from the pruning and/or quantizing and apply the pruned and/or quantized version of the ML predictor onto further input data such as replenishments of the local input data to subject the further input data to inference”
Sivaraman teaches:
“configured to retrieve a definition of the ML predictor from a server”: Sivaraman, paragraph 0039, “As a result of the relatively limited computing power of the edge nodes 12 and as discussed in more detail below, the ANNs 24 in the disclosed embodiments are each a pruned version of a parent ANN 32 included by the centralized server 14 [configured to retrieve a definition of the ML predictor from a server]. To this end, the centralized server 14 may include a pruning application that is arranged for commissioning a pruned ANN to each of the edge nodes 12 and the ANNs 24 may be referred to as ‘pruned’. The parent ANN 32 can take any of the forms noted above, such as a CNN, or ANN according to another deep learning model that is capable of detecting and classifying objects in images.”
“apply the ML predictor onto local input data so as to make the ML predictor performing the one or more inferences“: Sivaraman, paragraph 0028, “In view of the foregoing, various embodiments and implementations are directed to systems and methods for enhancing pedestrian safety and/or conducting surveillance via object recognition using pruned neural networks, particularly on resource-constrained edge nodes of communication networks. The pruned neural networks are created by a centralized server of the communication network to be customized for each of the edge nodes on the server. The creation of the pruned neural networks includes identifying a subject of objects (or classes) of a parent neural network and selecting the filters that respond highly to those objects. The subset of objects is determined by analyzing reference images actually captured at each edge node, such that the pruned neural network is customized specifically for each particular edge node [apply the ML predictor onto local input data so as to make the ML predictor performing the one or more inferences]. Furthermore, analysis of the reference images occurs by the parent neural network, which is unpruned, fully trained, and comprehensive. Ultimately, this enables a smaller neural network to be built filter-by-filter from the results of the parent neural network.”
“replace the ML predictor with a pruned and/or quantized version of the ML predictor which results from the pruning and/or quantizing and apply the pruned and/or quantized version of the ML predictor onto further input data such as replenishments of the local input data to subject the further input data to inference”: Sivaraman, paragraph 0028, “In view of the foregoing, various embodiments and implementations are directed to systems and methods for enhancing pedestrian safety and/or conducting surveillance via object recognition using pruned neural networks, particularly on resource-constrained edge nodes of communication networks. The pruned neural networks are created by a centralized server of the communication network to be customized for each of the edge nodes on the server. The creation of the pruned neural networks includes identifying a subject of objects (or classes) of a parent neural network and selecting the filters that respond highly to those objects. The subset of objects is determined by analyzing reference images actually captured at each edge node, such that the pruned neural network is customized specifically for each particular edge node. Furthermore, analysis of the reference images occurs by the parent neural network, which is unpruned, fully trained, and comprehensive. Ultimately, this enables a smaller neural network to be built filter-by-filter from the results of the parent neural network”; Sivaraman, paragraph 0029, “By deploying the pruned neural network to each edge node that includes only the subset of objects which are relevant to each specific edge node, low-capacity or resource-constrained computing devices ( e.g., consumer electronics) can detect certain relevant objects with very high accuracy at fast speeds with minimal resources [replace the ML predictor with a pruned and/or quantized version of the ML predictor which results from the pruning and/or quantizing and apply the pruned and/or quantized version of the ML predictor onto further input data such as replenishments of the local input data to subject the further input data to inference].”
Sivaraman and Wang are both related to the same field of endeavor (neural network pruning). It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the claimed invention to have combined the server-based pruning techniques of Sivaraman to the pruning teachings of Wang to arrive at the present invention, in order to create pruned networks specialized to local data, as stated in Sivaraman, paragraph 0029, “By deploying the pruned neural network to each edge node that includes only the subset of objects which are relevant to each specific edge node, low-capacity or resource-constrained computing devices ( e.g., consumer electronics) can detect certain relevant objects with very high accuracy at fast speeds with minimal resources.”
Regarding claim 19:
Wang as modified by Chen teaches the apparatus of claim 1.
Wang further teaches (bold only) “removing portions of the general ML predictor exclusively interconnected to one or more predetermined uninterested outputs of the ML predictor to acquire the ML predictor”: Wang, paragraph 0057, “As another example, scaling factor based pruning may be applied. The filters of the set of filters may be ranked based on importance scaling factors. For example, a BatchNormalization (BN) based scaling factor may be used to quantify the importance of different filters. The scaling factor may be obtained from e.g. batch-normalization or additional scaling layer. The filters may be arranged in descending order of the scaling factor, e.g. the BN-based scaling factor. The filters that are below a threshold percentile p % of the ranked filters may be pruned [removing portions of the general ML predictor exclusively interconnected to one or more predetermined uninterested outputs of the ML predictor to acquire the ML predictor].”
Wang as modified by Chen does not explicitly teach “configured to retrieve a definition of a general ML predictor from a server.”
Sivaraman teaches “configured to retrieve a definition of a general ML predictor from a server”: Sivaraman, paragraph 0039, “As a result of the relatively limited computing power of the edge nodes 12 and as discussed in more detail below, the ANNs 24 in the disclosed embodiments are each a pruned version of a parent ANN 32 included by the centralized server 14 [configured to retrieve a definition of a general ML predictor from a server]. To this end, the centralized server 14 may include a pruning application that is arranged for commissioning a pruned ANN to each of the edge nodes 12 and the ANNs 24 may be referred to as ‘pruned’. The parent ANN 32 can take any of the forms noted above, such as a CNN, or ANN according to another deep learning model that is capable of detecting and classifying objects in images.”
Sivaraman and Wang are both related to the same field of endeavor (neural network pruning). It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the claimed invention to have combined the server-based pruning techniques of Sivaraman to the pruning teachings of Wang to arrive at the present invention, in order to create pruned networks specialized to local data, as stated in Sivaraman, paragraph 0029, “By deploying the pruned neural network to each edge node that includes only the subset of objects which are relevant to each specific edge node, low-capacity or resource-constrained computing devices ( e.g., consumer electronics) can detect certain relevant objects with very high accuracy at fast speeds with minimal resources.”
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Hongui et al., “A Novel Pruning Algorithm for Self-organizing Neural Network,” 2009, Proceedings of International Joint Conference on Neural Networks, discloses a method of pruning nodes based on a measure of sensitivity, which is described as quantification of the relevance of network parameters including nodes.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to VINCENT SPRAUL whose telephone number is (703)756-1511. The examiner can normally be reached M-F 9:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, MICHAEL HUNTLEY can be reached at (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/VAS/
Examiner, Art Unit 2129
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129