Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Detailed Action
The following action is in response to the communication(s) received on 11/18/2025:
Claims 22, 37, 41, and 43 have been amended.
Claim 42 has been canceled.
Claim 45 has been added.
Claims 22-41 and 43-45 are pending.
Claims 22, 37, and 41 are independent claims.
Response to Arguments
Applicant’s arguments filed 11/18/2025 have been fully considered, but are not fully persuasive.
Applicant’s arguments the 35 U.S.C 101 rejection are persuasive. Thus, the eligibility rejections have been withdrawn.
Applicant arguments regarding the amended and new limitations have been considered for novelty and non-obviousness, but are unpersuasive.
Applicant asserts that Wiki does not teach the amended limitation compute the diversity enhancement term as a sum of the diversities that are respectively computed for each layer of the neural network, as Wiki only teaches computing the sum over pixel pairs. Examiner respectfully submits that Singh, not Wiki, teaches computing the correlation between filters of each layer Li (Singh, 3.4 Episode Selection); the summation involved in the expectation value of X and Y corresponds to the sum.
Applicant further asserts that Singh does not teach the amended limitation compute a respective diversity for each layer of the neural network, wherein the respective diversity for a layer of the neural network is computed based on a sum, over each filter pair within the layer, of a dot product of a normalized weight vector corresponding to a first filter of a filter pair and a normalized weight vector corresponding to a second filter of the filter pair, as Singh merely teaches the expectation calculation which involves the mean of X and Y. Examiner respectfully submits that the expectation calculation in Singh involves calculating the sum of each filter pairs and thus corresponds to the claim language. Wiki [p.6] further more explicitly teaches the cross-correlation equation which involves the dot product of a normalized weight vector to measure the correlation.
Applicant further asserts that Singh does not teach computing the respective diversity for each layer of the neural network, wherein the respective diversity for a layer of the neural network is computed based on a sum, over each filter pair within the layer. Examiner respectfully disagrees, as Singh does teach computing the sum over each filter pair within the layer (Singh [3.4. Episode Selection] In each layer Li, we find out the filter pairs that have the maximum correlation… Based on the magnitude of the correlation coefficient of each pair, filter pairs are ordered…), wherein the summation involved in the expectation value of X and Y corresponds to the sum.
Applicant further asserts that NIST does not teach the dot product of the normalized weight vector corresponding to the first filter of the filter pair and the normalized weight vector corresponding to the second filter of the filter pair. Examiner respectfully submits that the art rejection in claim 43 must be read in combination and not separately, and that this normalized weight vector is further taught by Wiki (p.6) with the ZNCC equation.
Applicant further asserts that the prior art does not disclose the features of the claims above. Examiner respectfully submits that each of these limitations are taught by their respective prior art as explained above. Applicant further asserts that a person having ordinary skill in the art would not be motivated to combine the references to arrive at Applicant’s claimed invention. Examiner respectfully submits that Applicant has not provided a reason why this would be the case, as the motivations to combine have been identified in the art rejection below.
Claims 23-36, 38-40, and 43-44 have are rejected by virtue of dependency to their respective parent claims.
Claim Objections
Claims 43 and 45 are objected to because of the following informalities:
Applicant seems to have overlooked the dependencies when amending claim 22 canceling claim 42. Claims 43 and 45 are still dependent on Claim 42. For purposes of examination, Claims 43 and 45 are interpreted to be dependent on Claim 22.
Appropriate correction is required.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 22-25, and 36-41are rejected under 35 U.S.C. 103 as being unpatentable over Singh et al, “Leveraging Filter Correlations for Deep Model Compression” (hereinafter Singh) further in view of Wiki, “Cross-correlation” (hereinafter Wiki), further in view of Liu et al, “Learning Efficient Convolutional Networks through Network Slimming” (hereinafter Liu), further in view of Huang et al., "CondenseNet: An Efficient DenseNet using Learned Group Convolutions" (hereinafter Huang), further in view of, further in view of Jeong et al., "IONN: Incremental Offloading of Neural Network Computations from Mobile Devices to Edge Servers" (hereinafter Jeong).
Regarding Claim 22, Singh teaches:
An apparatus comprising at least one processor; at least one memory including computer program code; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to: (Singh [Experiments] Our experiments were done on TITAN GTX-1080 Ti GPU and i7-4770 CPU@3.40GHz.)
train a neural network by minimizing an optimization loss function, wherein the optimization loss function considers empirical errors, compression, and model redundancy; wherein the optimization loss function that is minimized to train the neural network comprises a combination of a task loss term... ,and a diversity enhancement term wherein the diversity enhancement term enhances diversity of filters in the neural network;
(Singh [p.4 right]
PNG
media_image1.png
613
454
media_image1.png
Greyscale
) (Note: the lambda in eq. (11) corresponds to the parameter for the diversity enhancement term; adjusting the diversity enhancement term changes the relative significance of the task loss term, thus corresponding to controlling the significance of the task loss term relative to a significance of the diversity enhancement term)
wherein the optimization loss function comprises a parameter to control a significance of the task loss term relative to a significance of the diversity enhancement term;
(Singh, p.4 right,
PNG
media_image2.png
589
461
media_image2.png
Greyscale
) (Note: the lambda hyper-parameter controls the regularization term for the diversity enhancement term (Cst) within the optimization loss function, thus controlling the significance of the task loss term.)
compute a respective diversity for each layer of the neural network… (Singh [3.4. Episode Selection] In each layer Li, we find out the filter pairs that have the maximum correlation… Based on the magnitude of the correlation coefficient of each pair, filter pairs are ordered
[3.4. Episode Selection] Now we can calculate the correlation coefficient for filter pair by using Eq. 7
PNG
media_image3.png
58
308
media_image3.png
Greyscale
) (Note: the summation involved in the expectation value of X and Y corresponds to the sum)
Singh does not teach, but Wiki further teaches:
wherein the respective diversity for a layer of the neural network is computed based on a sum, over each filter pair within the layer, of a dot product of a normalized weight vector corresponding to a first filter of a filter pair and a normalized weight vector corresponding to a second filter of the filter pair;; (Wiki [p.6]
PNG
media_image4.png
541
800
media_image4.png
Greyscale
)
Wiki and Singh are analogous to the present invention because both are from the same field of endeavor of calculating zero-normalized cross correlation among vectors. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to implement the dot product of normalized vectors from Wiki into Singh/Wiki/Liu/Huang/Jeong’s correlation coefficient. The motivation would be to “for finding instances of a pattern or object within an image” (Wiki [p.6 ¶5]).
Singh, via Singh/Wiki, further teaches:
and compute the diversity enhancement term as a sum of the diversities that are respectively computed for each layer of the neural network, (Singh [3.4. Episode Selection] In each layer Li, we find out the filter pairs that have the maximum correlation… Based on the magnitude of the correlation coefficient of each pair, filter pairs are ordered
[3.4. Episode Selection] Now we can calculate the correlation coefficient for filter pair by using Eq. 7
PNG
media_image3.png
58
308
media_image3.png
Greyscale
) (Note: the summation involved in the expectation value of X and Y corresponds to the sum)
wherein the diversity enhancement term is computed to provide a measure of diversity between filters at a layer for a plurality of layers of the neural network;
(Singh [Abstract] We present a filter correlation based model compression approach for deep convolutional neural networks;
PNG
media_image5.png
200
426
media_image5.png
Greyscale
PNG
media_image6.png
418
563
media_image6.png
Greyscale
) (Note: the entire figure 1 flowchart, not the rectangle regarding “optimization,” is interpreted as the optimization loss function as recited by applicant; figure 2’s method of identifying correlating filters is interpreted as considering the model redundancy because the filters are being pruned due to being redundant, and provides details in the “optimization” rectangle taught in figure 1. Keeping only one of the highly correlated filters enhances the filter diversity in the layers; thus, the method in figure 2 corresponds to the diversity enhancement term. Fig. 1 diamond, which is a term in the flowchart, continues its loop until a threshold loss of the classification task is reached. Thus, fig. 1 diamond corresponds to the task loss term.)
prune the trained neural network by removing one or more filters from the set of filters, wherein the one or more filters that are removed from the set of filters have a respective diversity that is lower than a respective diversity measurement of filters that are not among the one or more filters, as determined by the respective diversity measurements of the filters of the set of filters including the one or more filters; (Singh [3.4. Episode Selection] This set St is the collection of the ready-to-prune filter pairs from all the layers and used for further optimization such that both filters in each pair contain similar information after optimization. Therefore we can safely remove one filter from each pair.
…Now we can calculate the correlation coefficient for filter pair by using Eq. 7
PNG
media_image7.png
88
463
media_image7.png
Greyscale
PNG
media_image8.png
404
593
media_image8.png
Greyscale
)
(Note: each “episode” is interpreted as the “set of filters” being determining filter diversities. Finding the largest correlation corresponds to measuring the lowest diversity among the respective diversity measurements.)
Singh/Wiki does not teach, but Liu further teaches:
a compression loss term...
(Liu [p.3 right last ¶]
PNG
media_image9.png
425
518
media_image9.png
Greyscale
) (Note: the second term in (1) denotes the sparsity-induced penalty, which corresponds to the compression loss term.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to apply Liu’s loss function that includes the pruning loss estimation into Singh/Wiki’s filter pruning method. The motivation would be to “directly obtain a narrow network …without resorting to any special sparse computation packages.” (Liu [Scaling Factors and Sparsity-induced Penalty])
Singh/Wiki/Liu does not teach, but Huang further teaches:
and transmit the pruned neural network over a communication network from an endpoint device to another endpoint device…
(Huang [p.4 right ¶2] After training we remove the pruned weights and convert the sparsified model into a network with a regular connectivity pattern that can be efficiently deployed on devices with limited computational power.)
Huang and Singh/Wiki/Liu are analogous to the present invention because both are from the same field of endeavor of pruning neural networks. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to implement the method of deploying the pruned network to the device with limited computational power from Huang to Singh’s model pruning method. The motivation would be to “A typical set-up for deep learning on mobile devices is one where CNNs are trained on multi-GPU machines but deployed on devices with limited compute.” (Huang [p.1 right ¶1]).
Singh/Wiki/Liu/Huang does not teach, but Jeong further teaches:
without transmitting the pruned neural network to a centralized server for processing. (Jeong [p.402 left ¶2] To solve this issue, we propose a new offloading approach, Incremental Offloading of Neural Network (IONN). IONN divides a client’s DNN model into several partitions and determines the order of uploading them to the server. The client uploads the partitions to the server one by one, instead of sending the entire DNN model at once. The server incrementally builds the DNN model as each DNN partition arrives, allowing the client to start offloading of DNN execution even before the entire DNN model is uploaded. That is, when there is a DNN query, the server will execute those partitions uploaded so far, while the client will execute the rest of the partitions, allowing collaborative execution. This incremental, partial DNN offloading enables mobile clients to use edge servers more quickly, improving the query performance.)
Jeong into Singh/Wiki/Liu/Huang are analogous to the present invention because both are from the same field of endeavor of efficient DNN model deployment. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to implement the edge server computation from Jeong into Singh/Wiki/Liu/Huang/Jeong’s pruning method. The motivation would be to “This incremental, partial DNN offloading enables mobile clients to use edge servers more quickly, improving the query performance” (Jeong [p.402 left ¶2]).
Regarding Claim 23, the Singh/Wiki/Liu/Huang/Jeong combination of Claim 22 teaches the method of Claim 22 (and thus the rejection of Claim 26 is incorporated). Singh, via Singh/Wiki/Liu/Huang/Jeong, further teaches:
Singh teaches the apparatus according to claim 22, wherein the apparatus is further caused to: determine filter diversities based on normalized cross correlations between weights of filters of the set of filters. (Singh [3.4. Episode Selection] Now we can calculate the correlation coefficient for filter pair by using Eq. 7
PNG
media_image7.png
88
463
media_image7.png
Greyscale
PNG
media_image8.png
404
593
media_image8.png
Greyscale
)
(Note: each “episode” is interpreted as the “set of filters” being determining filter diversities.)
Regarding Claim 24, the Singh/Wiki/Liu/Huang/Jeong combination of Claim 22 teaches the method of Claim 22 (and thus the rejection of Claim 26 is incorporated). Singh, via Singh/Wiki/Liu/Huang/Jeong, further teaches:
the apparatus according to claim 22, wherein the apparatus is further caused to: form a diversity matrix based on pair-wise normalized cross correlations quantified for a set of filter weights at layers of the neural network. (Singh [3.4. Episode Selection] In each layer Li, we find out the filter pairs that have the maximum correlation… Based on the magnitude of the correlation coefficient of each pair, filter pairs are ordered.) (Note: Forming a diversity matrix is interpreted as forming a collection of normalized cross-correlation performed on “a set of filter weights at layers of the neural network.” In addition, finding the “maximum correlation” for each layer means every possible filter correlation pair must be made, where the set of all pairs of correlations amounts to a diversity matrix.)
Regarding Claim 36, the Singh/Wiki/Liu/Huang/Jeong combination of Claim 22 teaches the method of Claim 22 (and thus the rejection of Claim 26 is incorporated). Singh, via Singh/Wiki/Liu/Huang/Jeong, further teaches:
the apparatus according to claim 22, wherein to prune the trained neural network, the apparatus is further caused to: layer-wise prune and network-wise prune. (Singh [3.4. Episode Selection]
PNG
media_image8.png
404
593
media_image8.png
Greyscale
) (Note: R represents the layer-wise pruning, while S represents the network-wise pruning.)
Independent Claim 37 recites a method to perform precisely the methods of Claim 22. Thus, Claim 37 is rejected for reasons set forth in Claims 1.
Claim 38-40, dependent on Claim 37, also recite precisely the methods of Claims 23, 26, and 27, respectively. Thus, Claim 38-40 are rejected for reasons set forth in Claims 23, 26, and 27, respectively.
Independent Claim 41 recites A computer program comprising computer program code configured to, when executed on at least one processor, cause an apparatus (Singh [Experiments] Our experiments were done on TITAN GTX-1080 Ti GPU and i7-4770 CPU@3.40GHz.) to perform precisely the methods of Claim 1, and thus it is rejected for reasons set forth in Claim 1.
Claim 25 is rejected under 35 U.S.C. 103 as being unpatentable over Singh/Wiki/Liu/Huang/Jeong in view of Joseph et al, “Demand Forecasting Using Automatic Machine-Learning Model Selection”, US 20200184494 A1, (hereinafter Joseph).
Regarding Claim 25, Singh, via Singh/Wiki/Liu/Huang/Jeong, teaches:
The apparatus according to claim 22, wherein the apparatus is further caused to: estimate accuracy of the pruned neural network; (Singh
PNG
media_image10.png
235
474
media_image10.png
Greyscale
) (diamond = estimate accuracy)
and retrain the pruned neural network. (Singh [2.2. Filter Pruning] After that, at each pruning step, re-training is needed to recover from the accuracy drop.)
Singh/Wiki/Liu/Huang/Jeong does not teach, but Joseph teaches:
retrain the pruned neural network when the accuracy of the pruned neural network is below a pre- defined threshold. (Joseph [0020] The system can be configured to determine whether to initiate a model retraining process or a model reselection process in response to changed conditions in the dataset, such as when the accuracy of the evaluated model has degraded below a threshold value (that may be user configurable), if new updates to the dataset are received over the network(s), or if new data sources become available on the network(s). Other criteria may be used for determining which machine model to select such as for example the model that is a best fit with a particular dataset based on other factors that may be uniquely configured for each different user environment.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to apply Joseph’s method of evaluating the accuracy of the model to decide whether to retrain the model into Singh/Huang’s pruning method. The motivation would be “to discover the machine learning model that… best fits with a particular dataset.” (Joseph [0049])
Claims 26-29 and 32 are rejected under 35 U.S.C. 103 as being unpatentable over Singh/Wiki/Liu/Huang/Jeong in view of Liu et al, “Learning Efficient Convolutional Networks through Network Slimming” (hereinafter Liu).
Regarding Claim 26, Liu, via Singh/Wiki/Liu/Huang/Jeong, teaches:
The apparatus according to claim 22, wherein the compression loss comprises an estimated pruning loss, and wherein to train the neural network, the apparatus is further caused to minimize the pruning loss. (Liu
PNG
media_image11.png
408
511
media_image11.png
Greyscale
PNG
media_image12.png
127
489
media_image12.png
Greyscale
)
(Note: “Sparsity-induced penalty on the scaling factors” is interpreted as the pruning loss calculation. The training objective is interpreted as a minimization problem, and equation (1) is interpreted as the minimization of the pruning loss.)
Regarding Claim 27, Singh/Wiki/Liu/Huang/Jeong does not teach, but Liu further teaches,
The apparatus according to claim 26, wherein the apparatus is further caused to: estimate the pruning loss, and wherein to estimate the pruning loss, the apparatus is further caused to: compute a first sum of scaling factors of the one or more filters to be removed from the set of filters after training; compute a second sum of scaling factors of the set of filters; and form a ratio of the first sum and the second sum. (Liu [Channel Pruning and Fine-tuning.] After training under channel-level sparsity-induced regularization, we obtain a model in which many scaling factors are near zero…Then we can prune channels with near-zero scaling factors, by removing all their incoming and outgoing connections and corresponding weights. We prune channels with a global threshold across all layers, which is defined as a certain percentile of all the scaling factor values. For instance, we prune 70% channels with lower scaling factors by choosing the percentile threshold as 70%. By doing so, we obtain a more compact network with less parameters and run-time memory, as well as less computing operations.) (Note: the defined certain percentile of all the scaling factor values is interpreted as forming a ratio of the first sum and the second sum of scaling factors.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to apply Liu’s estimation of the pruning loss into Singh/Huang’s filter pruning. The motivation would be to “obtain a more compact network with less parameters and run-time memory, as well as less computing operations.” (Liu [Channel Pruning and Fine-tuning])
Regarding Claim 28, Singh/Wiki/Liu/Huang/Jeong does not teach, but Liu further teaches:
The apparatus according to claim 26, wherein the apparatus is further caused to iteratively repeat the following for mini-batches of a training stage: rank filters of the set of filters according to scaling factors; (Liu [Leveraging the Scaling Factors in BN Layers.] It is common practice to insert a BN layer after a convolutional layer, with channel-wise scaling/shifting parameters. Therefore, we can directly leverage the γ parameters in BN layers as the scaling factors we need for network slimming. It has the great advantage of introducing no overhead to the network. In fact, this is perhaps also the most effective way we can learn meaningful scaling factors for channel pruning. 1), if we add scaling layers to a CNN without BN layer, the value of the scaling factors are not meaningful for evaluating the importance of a channel, because both convolution layers and scaling layers are linear transformations.) (Note: BN = Batch normalization; it is interpreted as the scaling factor)
select the filters that are below a threshold percentile of the ranked filters; and prune the selected filters temporarily during optimization of one of the mini-batches. (Liu [Introduction] Our approach imposes L1 regularization on the scaling factors in batch normalization (BN) layers, thus it is easy to implement without introducing any change to existing CNN architectures. Pushing the values of BN scaling factors towards zero with L1 regularization enables us to identify insignificant channels (or neurons), as each scaling factor corresponds to a specific convolutional channel (or a neuron in a fully-connected layer). (Note: “a threshold percentile” is interpreted as scaling factors near zero. The spec reads: “Those selected filters, which are candidates to be removed after the training stage, may be switched off by enforcing their outputs to zero i.e. temporarily pruned during the optimization of one mini-batch.” The L1 regularization taught by Liu pushes the scaling factors towards zero is interpreted as “pruning the selected filters temporarily during optimization of one of the mini-batches.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to apply Liu’s method of pruning the model based on a scaling factor and a set percentile threshold into Singh/Huang’s teaching. The motivation would be to “lower the number of computing operations, without compromising accuracy.” (Liu [Abstract])
Regarding Claim 29, the Singh/Huang/Liu combination of Claim 28 teaches the method of Claim 28 (and thus the rejection of Claim 28 is incorporated). Liu, via Singh/Wiki/Liu/Huang/Jeong, further teaches:
The apparatus according to claim 28, wherein the threshold percentile is user specified and is fixed during training. (Liu [Pruning] When we prune the channels of models trained with sparsity, a pruning threshold on the scaling factors needs to be determined. Unlike in [23] where different layers are pruned by different ratios, we use a global pruning threshold for simplicity. The pruning threshold is determined by a percentile among all scaling factors, e.g., 40% or 60% channels are pruned)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to apply Liu’s method of pruning the model based on a set percentile threshold into Singh’s teaching. The motivation would be to “lower the number of computing operations, without compromising accuracy.” (Liu [Abstract])
Regarding Claim 32, the Singh/Wiki/Liu/Huang/Jeong combination of Claim 28 teaches the method of Claim 26 (and thus the rejection of Claim 26 is incorporated). Huang, via Singh/Wiki/Liu/Huang/Jeong, further teaches:
The apparatus according to claim 26, wherein a sum of the model redundancy and the pruning loss is gradually switched off from the optimization loss function by multiplying with a factor changing from 1 to 0 during the training. (Huang [Learning rate.] We adopt the cosine shape learning rate schedule… which smoothly anneals the learning rate, and usually leads to improved accuracy … Figure 4 visualizes the learning rate as a function of training epoch (in magenta), and the corresponding training loss (blue curve) of a CondenseNet trained on the CIFAR-10 dataset... The abrupt increase in the loss at epoch 150 is causes by the final condensation operation, which removes half of the remaining weights. However, the plot shows that the model gradually recovers from this pruning step in the optimization stage.
PNG
media_image13.png
250
333
media_image13.png
Greyscale
)
(Note: the scale of the loss function factor is normalized from the initial training learning rate to achieve the scale of 1 to 0 during training.)
Claim 30 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Singh/Wiki/Liu/Huang/Jeong, and further in view of Li et al, “PRUNING FILTERS FOR EFFICIENT CONVNETS” (hereinafter Li).
Regarding Claim 30, the Singh/Wiki/Liu/Huang/Jeong/Liu combination of Claim 28 teaches the method of Claim 28 (and thus the rejection of Claim 28 is incorporated). The Singh/Wiki/Liu/Huang/Jeong/Liu combination does not teach, but Li teaches:
The apparatus according to claim 28, wherein the threshold percentile is dynamically changed from 0 to a user specified target percentile. (Li [3.2] We empirically determine the number of filters to prune for each layer based on their sensitivity to pruning. For deep networks such as VGG-16 or ResNets, we observe that layers in the same stage (with the same feature map size) have a similar sensitivity to pruning. To avoid introducing layer-wise meta-parameters, we use the same pruning ratio for all layers in the same stage. For layers that are sensitive to pruning, we prune a smaller percentage of these layers or completely skip pruning them.) (Note: “dynamically changed from 0 to a user specified target percentile” is interpreted as picking a range from 0 to a user specified target percentile across the filter layer. Skipping pruning a layer is interpreted as dynamically changing the user-specified target percentile to 0 for that particular layer, thus falling within the range of the pruning percentile.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to implement Li’s dynamic threshold percentile into Singh/Huang/Liu’s combined method of pruning filters from the layers. The motivation would be to account for layers “more sensitive to pruning than the first layers” (Liu [4.3])
Claim 31 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Singh/Wiki/Liu/Huang/Jeong/Liu, and further in view of Sivakumar et al, “Quantizing Neural Networks with Batch Normalization”, (hereinafter Sivakumar) in view of the provisional application US 62753595 filed 2018/10/31.
Regarding Claim 31, the Singh/Wiki/Liu/Huang/Jeong/Liu combination of Claim 28 teaches the method of Claim 28 (and thus the rejection of Claim 28 is incorporated). The Singh/Wiki/Liu/Huang/Jeong/Liu combination does not teach, but Sivakumar teaches:
The apparatus according to claim 28, wherein the filters are ranked according to a running average of scaling factors. (Sivakumar [Abstract] One of the methods includes receiving a first batch of training data; determining batch normalization statistics for the first batch of training data; determining a correction factor from the batch normalization statistics for the first batch of training data and the long-term moving averages of the batch normalization statistics; generating batch normalized weights from the floating point weights for the batch normalized first neural network layer, comprising applying the correction factor to the floating point weights of the batch normalized first neural network layer; quantizing the batch normalized weights; determining a gradient of an objective function; and updating the floating point weights using the gradient.) (“The long-term moving averages of the batch normalization statistics” is interpreted as “the long-term moving averages of the batch normalization statistics.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to apply the running average taught by Sivakumar into Singh/Wiki/Liu/Huang/Jeong/Lius combined method of ranking them as a scaling factor. The motivation would be to take account of “If there are outliers in the current batch, the current batch statistics will differ from the previous batch statistics and the long-term moving averages, causing undue jitter in the quantized weights” (Sivakumar 43) during Liu’s sparsity-induced regularization.
Regarding Claim 44, the Singh/Wiki/Liu/Huang/Jeong combination of Claim 32 teaches the method of Claim 32 (and thus the rejection of Claim 32 is incorporated). Singh, via Singh/Wiki/Liu/Huang/Jeong, further teaches:
The apparatus of claim 32, wherein the diversity enhancement term corresponds to the model redundancy
(Singh
PNG
media_image6.png
418
563
media_image6.png
Greyscale
PNG
media_image1.png
613
454
media_image1.png
Greyscale
) (Note: Cst corresponds to the diversity enhancement term; as this helps remove highly correlated pairs, this enhances diversity, thus corresponding to model redundancy.)
Claim 33 is rejected under 35 U.S.C. 103 as being unpatentable over Singh/Wiki/Liu/Huang/Jeong in view of Signori, “CHAPTER 8: MULTICOLLINEARITY”, (hereinafter Signori).
Regarding Claim 33, Singh/Wiki/Liu/Huang/Jeong does not teach, but Signori teaches:
The apparatus according to claim 22, wherein to prune the trained neural network, the apparatus is further caused to: rank filters of the set of filters based on column-wise summation of a diversity matrix. (Signori
PNG
media_image14.png
814
1059
media_image14.png
Greyscale
) (Note: OLS means ordinary least squares. The OLS regression is interpreted as based on column-wise summation of a diversity matrix, as it is considering all of the k “explanatory variables” from the regression equation.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to apply Signori’s multicollinearity test to Singh/Wiki/Liu/Huang/Jeong’s method of determining highly redundant filters. The motivation would be to calculate “the extent to which an explanatory variable can be explained by all the other explanatory variables in the equation.” (Signori [p.5])
Claim 34 is rejected under 35 U.S.C. 103 as being unpatentable over Singh/Wiki/Liu/Huang/Jeong in view of Fan et al, “Response to Call for Evidence on Neural Network Compression” (hereinafter Fan).
Regarding Claim 34, Singh/Wiki/Liu/Huang/Jeong not teach, but Fan teaches:
The apparatus according to claim 22, wherein to prune the trained neural network, the apparatus is further caused to: rank the filters of the set of filters based on an importance scaling factor; and prune the filters that are below a threshold percentile of the ranked filters. (Fan [2 Proposed evidence for NN compression] In this example method, the empirical error term in (1) is measured by cross-entropy of classification labels w.r.t. group truth labels, and the model complexity is quantified by the channel-wise scaling factors specific to batch normalization layers. Any channels with scaling factors below a given threshold (or rank) are pruned to reduce the model complexity.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to apply the threshold and scaling factor taught from Fan into the pruning method of Singh/Wiki/Liu/Huang/Jeong. The motivation would be to “simultaneously learn to perform the original tasks (e.g., image classification) as well as to reduce the model size and computational complexity and memory footprints.” (Fan [2 Proposed evidence for NN compression])
Claim 35 is rejected under 35 U.S.C. 103 as being unpatentable over Singh/Wiki/Liu/Huang/Jeong in view of Fan, and further in view of Signori.
Regarding Claim 35, Singh/Wiki/Liu/Huang/Jeong not teach, but Fan teaches:
The apparatus according to claim 22, wherein to prune the trained neural network, the apparatus is further caused to: rank the filters of the set of filters based on…an importance scaling factor; and prune the filters that are below a threshold percentile of the ranked filters. (Fan [2 Proposed evidence for NN compression] In this example method, the empirical error term in (1) is measured by cross-entropy of classification labels w.r.t. group truth labels, and the model complexity is quantified by the channel-wise scaling factors specific to batch normalization layers. Any channels with scaling factors below a given threshold (or rank) are pruned to reduce the model complexity.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to apply the threshold and scaling factor taught from Fan into the pruning method of Singh/Wiki/Liu/Huang/Jeong. The motivation would be to “simultaneously learn to perform the original tasks (e.g., image classification) as well as to reduce the model size and computational complexity and memory footprints.” (Fan [2 Proposed evidence for NN compression])
Further, the combination of Singh/Wiki/Liu/Huang/Jeong/Fan doesn’t teach, but Signori teaches:
rank the filters of the set of filters based on column-wise summation of a diversity matrix (Signori
PNG
media_image14.png
814
1059
media_image14.png
Greyscale
) (Note: OLS means ordinary least squares. The OLS regression is interpreted as based on column-wise summation of a diversity matrix, as it is considering all of the k “explanatory variables” from the regression equation.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to apply Signori’s multicollinearity test to the combination of Singh/Fan’s method of determining highly redundant filters. The motivation would be to calculate “the extent to which an explanatory variable can be explained by all the other explanatory variables in the equation.” (Signori [p.5])
Claims 43 and 45 are rejected under 35 U.S.C. 103 as being unpatentable over Singh/Wiki/Liu/Huang/Jeong in view of NIST, "CORRELATION ABSOLUTE VALUE" (hereinafter NIST).
Regarding Claim 43, Singh/Wiki/Liu/Huang/Jeong teaches the method of Claim 42 (and thus the rejection of Claim 42 is incorporated). Wiki, via Singh/Wiki/Liu/Huang/Jeong, further teaches:
The apparatus of claim 42, wherein … the dot product of the normalized weight vector corresponding to the first filter of the filter pair and the normalized weight vector corresponding to the second filter of the filter pair… (Wiki [p.6]
PNG
media_image4.png
541
800
media_image4.png
Greyscale
)
Singh/Liu/Huang/Wiki does not teach, but NIST further teaches:
1 minus [above equation] is bounded to be greater than or equal to 0(NIST, p.1, Description:
The correlation coefficient is a measure of the linear relationship between two variables…A perfect linear relationship yields a correlation coefficient of +1 (or -1 for a negative relationship) and no linear relationship yields a correlation coefficient of 0.
This command takes the absolute value of the correlation coefficient. That is, we are interested in the magnitude of the correlation… without regard to direction. For example, if we are screening a large number of pairwise correlations, we may want to identify correlations that exceed a threshold without taking into account the direction of the relationship.) (Note: perfect linear coefficient of +-1 corresponds to 0 diversity; no linear relationship corresponds to 1 diversity; thus, 1 minus the absolute correlation equation corresponds to the boundary between 0 and 1 in diversity)
NIST and Singh/Liu/Huang/Wiki are analogous to the present invention because both are from the same field of endeavor of measuring correlations. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to implement the absolute value calculation from NIST into Singh/Liu/Huang/Wiki’s method of calculating the coefficient between filters. The motivation would be to “we are interested in the magnitude of the correlation without… regard to direction. For example, if we are screening a large number of pairwise correlations, we may want to identify correlations that exceed a threshold without taking into account the direction of the relationship” (NIST, p.1).
Regarding Claim 45, Singh/Wiki/Liu/Huang/Jeong teaches the method of Claim 42 (and thus the rejection of Claim 42 is incorporated). Wiki, via Singh/Wiki/Liu/Huang/Jeong/Wiki, further teaches:
The apparatus of claim 42, wherein …the dot product of the normalized weight vector corresponding to the first filter of the filter pair and the normalized weight vector corresponding to the second filter of the filter pair…(Wiki [p.6]
PNG
media_image4.png
541
800
media_image4.png
Greyscale
)
Singh/Liu/Huang/Wiki does not teach, but NIST further teaches:
…1 minus [above equation] is bounded to be less than or equal to 1. (NIST, p.1, Description:
The correlation coefficient is a measure of the linear relationship between two variables…A perfect linear relationship yields a correlation coefficient of +1 (or -1 for a negative relationship) and no linear relationship yields a correlation coefficient of 0.
This command takes the absolute value of the correlation coefficient. That is, we are interested in the magnitude of the correlation… without regard to direction. For example, if we are screening a large number of pairwise correlations, we may want to identify correlations that exceed a threshold without taking into account the direction of the relationship.) (Note: perfect linear coefficient of +-1 corresponds to 0 diversity; no linear relationship corresponds to 1 diversity; thus, 1 minus the absolute correlation equation corresponds to the boundary between 0 and 1 in diversity)
NIST and Singh/Wiki/Liu/Huang/Jeong are analogous to the present invention because both are from the same field of endeavor of measuring correlations. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to implement the absolute value calculation from NIST into Singh/Wiki/Liu/Huang/Jeong’s method of calculating the coefficient between filters. The motivation would be to “we are interested in the magnitude of the correlation without… regard to direction. For example, if we are screening a large number of pairwise correlations, we may want to identify correlations that exceed a threshold without taking into account the direction of the relationship” (NIST, p.1).
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JOSEP HAN whose telephone number is (703)756-1346. The examiner can normally be reached Mon-Fri 9am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on (571) 272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/J.H./Examiner, Art Unit 2122
/KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122