DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to argument
Applicant's arguments filed 11/21/2025 ("Arguments/Remarks") have been fully considered but they are not persuasive.
Argument – 1: (page: 10 – 11) Applicant contends:
“Claims 1 – 25 were rejected under 35 U.S.C. § 101 on the ground that the claims are directed towards an abstract idea without significantly more. Considering the Director Squires' Decision on Request for Rehearing in Ex parte Desjardins on September 26, 2025, Appeal 2024- 000567, Application 16/319,040 (hereinafter "Director Squires' Decision"), claims 1-25 are eligible under 35 U.S.C. § 101.
Taking claim 1 for example, claim 1 is directed to compressing a DNN by generating a sequence of graph representations, clustering layers of the DNN based on the sequence of graph representations, determining a pruning ratio for a group of layers, and pruning filters in the group based on the pruning ratio to generate compressed layers, and replacing the layers in the group with the compressed layers. Such a claim is eligible in light of Director Squires' Decision.
… Similarly here, the Specification provides in paragraph [0017]: "[t]he present invention provides a learning framework that can be applied to a large variety of DNNs and provides an effective solution for group-wise filter pruning. The DNN compression method improves the efficiency of filter pruning." The Specification also provides in paragraph [0016]: "[c]ompared with filter pruning on a per-layer level, the filter pruning on a per-group level is more efficient as the number of filter groups is reduced. Accordingly, the group of layers as a whole achieves the desired sparsity level. The DNN is updated with the compressed layers in the group. The updated DNN has a higher sparsity level and smaller size." Therefore, the Specification identifies improvements in DNN efficiency, particularly in filter pruning.”
Regarding the above argument, the Examiner notes that in the Ex parte Desjardins, the claim is directed to a specific enhancement in how a machine learning model operates, rather than an abstract idea or mathematical concept. The specification provides the required technical details explaining how model parameters are adjusted to optimize performance on a new task while protecting performance on a prior task, which addresses the technical problem of knowledge degradation. The claim also reflects the disclosed improvement in the specification. Unlike the Ex parte Desjardins, in the instant application, the paragraphs cited by the Applicant (¶[0016] and ¶[0017]) identify an intended improvement, such as: increased pruning efficiency through group wise pruning, which do not provide sufficient technical details explaining how that improvement is actually achieved. The disclosure merely asserts that pruning at the group level is more efficient that pruning at the layer level, without describing the specific technical mechanisms involved. As a result, the alleged improvement is presented in a conclusory manner rather than being supported by concrete technical implementation details. See MPEP 2106.04, the specification should be evaluated to determine if the disclosure provides sufficient details such that one of ordinary skill in the art would recognize the claimed invention as providing an improvement. The specification need not explicitly set forth the improvement, but it must describe the invention such that the improvement would be apparent to one of ordinary skill in the art. Conversely, if the specification explicitly sets forth an improvement but in a conclusory manner (i.e., a bare assertion of an improvement without the detail necessary to be apparent to a person of ordinary skill in the art), the Examiner should not determine the claim improves technology.
Argument – 2: (page: 12) Applicant contends:
“Taking claim 1 for example, the combination of the cited references fails to teach or render obvious "clustering the plurality of layers into groups of layers based on the sequence of graph representations, each of the groups of layers comprising a subset of the plurality of layers; determining a pruning ratio for a group of layers based on the graph representations of the layers in the group, the pruning ratio indicating a percentage of filters to be pruned from the layers in the group." The Office action refers to Coelho as disclosing the clustering step. Coelho teaches grouping layers of a neural network for quantizing inputs or outputs. However, Coelho is silent regarding grouping neural network layers for pruning filters of the layers in the group. The Office action refers to Li as teaching pruning filters. Li at best teaches a pruning ratio, like 90%, for filters of a neural network. However, Li is silent regarding any pruning ratio determination for a group of layers that is a subset of layers of the neural network. Hence, Li cannot remedy the deficiency of Coelho.”
Regarding the above argument, the Examiner respectfully disagrees with Applicant’s assertion that Coelho is silent regarding grouping neural network layers for pruning filters of the layers in the group. The claim limitation as claimed in light of the teaching of the prior art disclosed in Coelho, ¶[0006] discloses grouping neural network layers by describing the clustering of plurality of layers into groups based on the sequence of graph representations. Each group can include one or more adjacent layers, indicating that the specification recognize layer grouping as a structural or organizational concept within the neural network. As it was stated by the Applicant, pruning filters based on grouped layers is explicitly taught by Li reference (page: 6). Although the Applicant asserts that Li does not teach any pruning ratio determination for a group of layers that is a subset of layers of the neural network, this concept is explicitly taught by Yu, pg. 3, which discloses using a graph based representation of the network to determine pruning ratios for layer groups, which then applied to selectively prune subsets of layers within the DNN.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claim(s) 1 - 25 rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception (i.e. an abstract idea) without significantly more.
In step 1, of the 101-analysis set forth in the MPEP 2106, the examiner has determined
that the following limitations recite a process that, under the broadest reasonable interpretation, falls within one or more statutory categories (processes).
In step 2A prong 1, of the 101-analysis set forth in MPEP 2106, the examiner has determined
that the following limitations recite a process that, under broadest reasonable interpretation, covers
a mental process but for the recitation of generic computer components:
Regarding claim 1:
generating a sequence of graph representations based on attributes of the plurality of layers, each of the plurality of layers represented by a graph representation in the sequence; (i.e.: the broadest reasonable interpretation, the claim recites abstract idea: mental process: it involves analyzing the attributes of the layers, recognize structural patterns, and make decisions about how to generate a sequence of graph based on the attribute).
clustering the plurality of layers into groups of layers based on the sequence of graph representations, each of the groups of layers comprising a subset of the plurality of layers
(i.e.: the broadest reasonable interpretation, the claim recites abstract idea: mental process: it involves analyzing the sequence of graph representations, recognize patterns or similarities among layers, and make decisions on how to cluster them into groups.).
determining a pruning ratio for a group of layers based on the graph representations of the layers in the group, the pruning ratio indicating a percentage of filters to be pruned from the layers in the group; (i.e.: the broadest reasonable interpretation, the claim recites abstract idea: mental process: it involves analyzing the graph representations of the layers, observe patterns in connectivity and importance, evaluate the trade-offs between model complexity and performance, and make a judgment on the appropriate pruning ratio).
pruning filters of the layers in the group based on the pruning ratio to generate compressed layers ; (i.e.: the broadest reasonable interpretation, the claim recites abstract idea: mental process: it involves analyzing layer representations, organizing filters based on similarity, importance, or spatial relationships and deciding a pruning ratio for compression).
If the claim limitations, under their broadest reasonable interpretation, covers performance of the limitations as a mental process, but for the recitation of generic computer components, then it falls within the mental process. Accordingly, the claim recites an abstract idea.
Step 2A Prong 2 of the 101-analysis, set forth in MPEP 2106, the examiner has determined that
the following additional elements do not integrate this judicial exception into a practical application:
As evaluated below:
• The preamble is deemed insufficient to transform the judicial exception to a patentable
invention to a patentable invention because the preamble generally links the use of a
judicial exception to a particular technological environment or field of use, see MPEP
2106.05(h).
accessing the DNN that has been trained, the DNN comprising a plurality of layers
(i.e.: deemed insufficient to transform the judicial exception to a patentable invention because the claim recites limitation directed to mere data gathering as deemed insufficient to transform the judicial exception because claimed elements are considered insignificant extra-solution activity, See MPEP (2106.05(g))).
updating the DNN by replacing the layers in the group with the compressed layers.
(i.e.: deemed insufficient to transform the judicial exception to a patentable invention because the claim recites limitation which does not amount to more than a recitation of the words "apply it" (or an equivalent), such as mere instructions to implement an abstract idea on a computer. See MPEP 2106.05(f)).
In Step 2B of the 101-analysis set forth in the 2019 PEG, the examiner has determined that the
claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception:
Regarding limitation (V and VI), recite mere application of the abstract idea or mere instructions to implement an abstract idea on a computer are deemed insufficient to transform the judicial exception to a patentable invention to a patentable invention because the limitations generally apply the use of a generic computer and/or process with the judicial exception, see MPEP
2106.05(f).
Regarding limitation (IV), additional elements considered extra/post solution activity, as analyzed above, are activity that are well-understood routine and conventional, specifically: the courts have recognized the computer functions as well‐understood, routine, and conventional functions.
Receiving or transmitting data over a network, e.g., using the Internet to gather data, Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362 (utilizing an intermediary computer to forward information); TL| Communications LLC v. AV Auto. LLC, 823 F.3d 607, 610, 118 USPQ2d 1744, 1745 (Fed. Cir. 2016) (using a telephone for image transmission); OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network); buySAFE, Inc. v. Google, Inc., 765 F.3d 1350, 1355, 112 USPQ2d 1093, 1096 (Fed. Cir. 2014) (computer receives and sends information over a network). See MPEP 2106.05(d)(II).
As analyzed above, the additional elements, analyzed above, do not integrate the noted judicial exception into a practical application because they do not impose any meaningful limits on practicing the abstract idea. Therefore, the claim is directed to an abstract idea.
Regarding claim 11,
One or more non-transitory computer-readable media storing instructions executable to perform operations for compressing a deep neural network (DNN)
Deemed insufficient to transform the judicial exception to a patentable invention because the limitation is directed to mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea and are considered to adding the words “apply it” (or an equivalent) with the judicial exception, See MPEP 2106.05(f).
Limitations directed to using the computer as a tool for implementing an abstract idea cannot integrate a judicial exception into a practical application at Step 2A or provide an inventive concept in Step 2B.
The rest of the limitations recite analogous subject matter as claim 1, so are rejected under similar rationale.
Regarding claim 21,
a computer processor for executing computer program instructions; and a non-transitory computer-readable memory storing computer program instructions executable by the computer processor to perform operations
Deemed insufficient to transform the judicial exception to a patentable invention because the limitation is directed to mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea and are considered to adding the words “apply it” (or an equivalent) with the judicial exception, See MPEP 2106.05(f).
Limitations directed to using the computer as a tool for implementing an abstract idea cannot integrate a judicial exception into a practical application at Step 2A or provide an inventive concept in Step 2B.
The rest of the limitations recite analogous subject matter as claim 1, so are rejected under similar rationale.
Regarding claim 2, dependent upon claim 1, and fail to resolve the deficiencies identified above by
integrating the judicial exception into a practical application, or introducing significantly more than the judicial exception. The claim recites:
inputting the graph representations of the layers in the group into a pre-trained graph representation neural network, the pre-trained graph representation neural network outputting the pruning ratio.
The recitation in the additional limitation directed to mere data gathering as deemed insufficient to transform the judicial exception because claimed elements are considered insignificant extra-solution activity and well-understood routine and conventional (2106.05(d)).
Receiving or transmitting data over a network, e.g., using the Internet to gather data, Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362 (utilizing an intermediary computer to forward information); TL| Communications LLC v. AV Auto. LLC, 823 F.3d 607, 610, 118 USPQ2d 1744, 1745 (Fed. Cir. 2016) (using a telephone for image transmission); OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network); buySAFE, Inc. v. Google, Inc., 765 F.3d 1350, 1355, 112 USPQ2d 1093, 1096 (Fed. Cir. 2014) (computer receives and sends information over a network). See MPEP 2106.05(d)(II).
The additional limitations as analyze failed to integrate a judicial exception into a practical application at Step 2A and provide an inventive concept in Step 2B, per the analysis above.
Claim(s) 12 and 22, recite similar subject matter as claim 2, so are rejected under the same rationale.
Regarding claim 3, dependent upon claim 2, and fail to resolve the deficiencies identified above by
integrating the judicial exception into a practical application, or introducing significantly more than the judicial exception. The claim recites:
further training the graph representation neural network by using the pruning ratio and the graph representations of the layers in the group as a new training sample.
Deemed insufficient to transform the judicial exception to a patentable invention because the limitation is directed to mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea and are considered to adding the words “apply it” (or an equivalent) with the judicial exception, See MPEP 2106.05(f).
Limitations directed to using the computer as a tool for implementing an abstract idea cannot integrate a judicial exception into a practical application at Step 2A or provide an inventive concept in Step 2B.
Claim(s) 13 and 23, recite similar subject matter as claim 3, so are rejected under the same rationale.
Regarding claim 4, dependent upon claim 3, and fail to resolve the deficiencies identified above by
integrating the judicial exception into a practical application, or introducing significantly more than the judicial exception. The claim recites:
determining whether an accuracy of the updated DNN is higher than a target accuracy
(i.e.: the broadest reasonable interpretation, the claim recites abstract idea: mental process. It involves reviewing an accuracy metrics, compare them to the target, and make a determination based on their understanding of the model’s performance, See (MPEP 2106.04)).
in response to determining that the accuracy of the updated DNN is higher than the target accuracy, using the pruning ratio and the graph representations of the layers in the group as a positive training sample.
The recitation in the additional limitation simply links the judicial exception to a field of use and/or technology environment, see MPEP 2106.05(h).
Limitations directed to field of use cannot integrate a judicial exception into a practical application at Step 2A or provide an inventive concept in Step 2B.
Claim 14 recites similar subject matter as claim 4, so is rejected under the same rationale.
Regarding claim 5, dependent upon claim 4, and fail to resolve the deficiencies identified above by
integrating the judicial exception into a practical application, or introducing significantly more than the judicial exception. The claim recites:
in response to determining that the accuracy of the updated DNN is lower than the target accuracy, using the pruning ratio and the graph representations of the layers in the group as a negative training sample.
The recitation in the additional limitation simply links the judicial exception to a field of use and/or technology environment, see MPEP 2106.05(h).
Limitations directed to field of use cannot integrate a judicial exception into a practical application at Step 2A or provide an inventive concept in Step 2B.
Claim 15 recites similar subject matter as claim 5, so is rejected under the same rationale.
Regarding claim 6, dependent upon claim 1, and fail to resolve the deficiencies identified above by
integrating the judicial exception into a practical application, or introducing significantly more than the judicial exception. The claim recites:
wherein generating the sequence of graph representations comprises: for each respective layer of the plurality of layers: determining one or more attributes of the respective layer;
(i.e.: the broadest reasonable interpretation, the claim recites abstract idea: mental process. It involves analyzing layer’s characteristics, such as its number of neurons, activation functions, or weight distributions, and determine relevant attributes based on that, See (MPEP 2106.04)).
generating a graph representation in the sequence based on the one or more attributes
(i.e.: the broadest reasonable interpretation, the claim recites abstract idea: mental process. It involves analyzing the attributes of a layer, recognize relationships among them, and construct a visual representation of the data, See (MPEP 2106.04)).
Claim 16 recites similar subject matter as claim 6, so is rejected under the same rationale.
Regarding claim 7, dependent upon claim 6, and fail to resolve the deficiencies identified above by
integrating the judicial exception into a practical application, or introducing significantly more than the judicial exception. The claim recites:
wherein the one or more attributes are selected from a group consisting of size of input data, size of output data, size of kernel, and some combination thereof.
The recitation in the additional limitation simply links the judicial exception to a field of use and/or technology environment, see MPEP 2106.05(h).
Limitations directed to field of use cannot integrate a judicial exception into a practical application at Step 2A or provide an inventive concept in Step 2B.
Claim 17 recites similar subject matter as claim 7, so is rejected under the same rationale.
Regarding claim 8, dependent upon claim 6, and fail to resolve the deficiencies identified above by
integrating the judicial exception into a practical application, or introducing significantly more than the judicial exception. The claim recites:
wherein the DNN further comprises activations configured to apply activation functions on outputs of some of the plurality of layers; and
(i.e.: the broadest reasonable interpretation, the claim recites abstract idea: a mathematical concept, apply activation functions (e.g., ReLU, Sigmoid, Tanh) on the output layer merely change numerical outputs based on predefined formulas).
generating the sequence of graph representations further comprises:
(i.e.: the broadest reasonable interpretation, the claim recites abstract idea: mental process. It involves analyzing the attributes of the layers, recognize structural patterns, and make decisions about how to generate a sequence of graph based on the attribute, See (MPEP 2106.04)).
for each respective activation of the activations: determining one or more attributes of the respective activation, and
(i.e.: the broadest reasonable interpretation, the claim recites abstract idea: mental process. It involves analyzing activation layer’s characteristics, such as its number of neurons, activation functions, or weight distributions, and determine relevant attributes based on that).
generating a graph representation in the sequence based on the one or more attributes.
(i.e.: the broadest reasonable interpretation, the claim recites abstract idea: mental process. It involves analyzing the attributes of the activation layers, recognize structural patterns, and make decisions about how to generate a graph based on the activation layer’s attribute).
Claim(s) 18 and 24 recite similar subject matter as claim 8, so are rejected under the same rationale.
Regarding claim 9, dependent upon claim 1, and fail to resolve the deficiencies identified above by
integrating the judicial exception into a practical application, or introducing significantly more than the judicial exception. The claim recites:
inputting the graphs of the layers and an evaluation metric into a graph pooling model, the graph pooling model outputting the groups,
The recitation in the additional limitation directed to mere data gathering as deemed insufficient to transform the judicial exception because claimed elements are considered insignificant extra-solution activity and well-understood routine and conventional (2106.05(d)).
Receiving or transmitting data over a network, e.g., using the Internet to gather data, Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362 (utilizing an intermediary computer to forward information); TL| Communications LLC v. AV Auto. LLC, 823 F.3d 607, 610, 118 USPQ2d 1744, 1745 (Fed. Cir. 2016) (using a telephone for image transmission); OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network); buySAFE, Inc. v. Google, Inc., 765 F.3d 1350, 1355, 112 USPQ2d 1093, 1096 (Fed. Cir. 2014) (computer receives and sends information over a network). See MPEP 2106.05(d)(II).
The additional limitations as analyze failed to integrate a judicial exception into a practical application at Step 2A and provide an inventive concept in Step 2B, per the analysis above.
wherein the evaluation metric comprises a target accuracy of the updated DNN.
The recitation in the additional limitation simply links the judicial exception to a field of use and/or technology environment, see MPEP 2106.05(h).
Limitations directed to field of use cannot integrate a judicial exception into a practical application at Step 2A or provide an inventive concept in Step 2B.
Claim 19 recites similar subject matter as claim 9, so is rejected under the same rationale.
Regarding claim 10, dependent upon claim 1, and fail to resolve the deficiencies identified above by
integrating the judicial exception into a practical application, or introducing significantly more than the judicial exception. The claim recites:
ranking the filters based on magnitudes of weights in the filters;
(i.e.: the broadest reasonable interpretation, the claim recites abstract idea: mental process. It involves analyzing the weights, compare their magnitudes, and rank the filters accordingly).
selecting one or more filters from the filters based on the ranking and the pruning ratio; and
(i.e.: the broadest reasonable interpretation, the claim recites abstract idea: mental process. It involves analyzing reviewing the ranked filters, assess their importance relative to the pruning goal, and select filters based on the ranking and required pruning ratio.).
changing magnitudes of weights in the one or more filters to zero.
(i.e.: the broadest reasonable interpretation, the claim recites abstract idea: mental process. It involves analyzing the weight values of the filters and decide which ones should be assigned to zero, and adjust them accordingly.)
Claim(s) 20 and 25 recite similar subject matter as claim 10, so are rejected under the same rationale.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1 – 3, 6 – 7, 11 – 13, 16 – 17 and 21 – 23 are rejected under 35 U.S.C. 103 as being unpatentable over Yu et al., " GNN-RL Compression: Topology-Aware Network Pruning using Multi-stage Graph Embedding and Reinforcement Learning”, (hereafter Yu) in view of Li, et al. "Pruning filters for efficient convnets", (hereafter Li) and Coelho, et al., Pub. No.: US20240220867A1, (hereafter Coelho).
Regarding claim 1, Yu teaches, A method for compressing a deep neural network (DNN), the method comprising:
(Yu, page: 2, “In a nutshell, we model a given DNN as hierarchical computational graphs [a deep neural network (DNN)] and propose multi-stage graph neural networks (m-GNN) to embed DNNs. Additionally, we equipped m-GNN with a reinforcement learning agent (GNN-RL) to automatically search for the compression [A method for compressing] policy (e.g., pruning ratios). To avoid tiny compression ratios due to the negative correlation between the compression ratio and RL agent’s reward, we created a DNN-Graph environment for the GNN-RL agent”)
generating a sequence of graph representations based on attributes of the plurality of layers, each of the plurality of layers represented by a graph representation in the sequence;
PNG
media_image1.png
176
496
media_image1.png
Greyscale
[AltContent: textbox ([generating a sequence of graph representations based on attributes of the plurality of layers])](Yu, fig. 2(a – b))
(Yu, page: 4, “Figure 2. A two-level hierarchical computational graph and m-GNN. The sub-graphs [each of the plurality of layers represented by a graph representation in the sequence] are painted with red, blue, and green colors”.)
accessing the DNN that has been trained, the DNN comprising a plurality of layers;
(Yu, page: 3, “We formally model a given DNN [accessing the DNN that has been trained] as an l-level hierarchical computational graph, such that at the l th level (the top level), we would have the hierarchical computational graph set G l = {Gl}, where each item is a computational graph Gl = (V l , E l , G l−1 ). V l is the graph nodes corresponding to hidden states. E l is the set of directed edges with a specific edge type associated with the operations. Lastly, G l−1 = {G l−1 0 , Gl−1 1 , ...} is the computational graph set at the (l − 1)-level as well as the operation set at layer l [the DNN comprising a plurality of layers].”)
determining a pruning ratio for a group of layers based on the graph representations of the layers in the group
(Yu, page: 3, “To prune a given DNN, the user provides the model size constraint (i.e., FLOPs constraint). Although we perform FLOPs-constraint filter pruning, our method is not limited to FLOPs-constraint and can be easily extended to latency, MACs, or sparsity constraint compression. Figure 1 illustrates the DNN-Graph search episode, which is essentially a model compression iteration. Red arrows show that the process starts from the original DNN. The model size evaluator first evaluates the size of the DNN. If the size is not satisfied, the graph generator converts the DNN into a hierarchical computational graph. Then, the GNN-RL agent leverages m-GNN to learn pruning ratios [determining a pruning ratio for a group of layers]from the graph [based on the graph representations of the layers in the group]. The pruner prunes the DNN with the pruning ratios.)
Yu does not teach:
clustering the plurality of layers into groups of layers based on the sequence of graph representations each of the groups of layers comprising a subset of the plurality of layers;
the pruning ratio indicating a percentage of filters to be pruned from the layers in the group;
pruning filters of the layers in the group based on the pruning ratio to generate compressed layers; and updating the DNN by replacing the layers in the group with the compressed layers
Coelho teaches:
clustering the plurality of layers into groups of layers based on the sequence of graph representations,
(Coelho, “[0006] In general, one innovative aspect of the subject matter described in this specification can be embodied in methods that include the actions of receiving data representing a neural network comprising a plurality of layers arranged in a sequence, selecting one or more groups of layers from the plurality of layers, each group of layers comprising one or more layers adjacent to each other in the sequence [clustering the plurality of layers into groups of layers based on the sequence of graph representations], generating a new machine learning model that corresponds to the neural network.”)
each of the groups of layers comprising a subset of the plurality of layers;
(Coelho, page: 2, “[0007] The generation of the new machine learning model includes, for each group of layers, selecting a respective decision tree that replaces the group of layers [each of the groups of layers comprising a subset of the plurality of layers]. The respective decision tree receives as input a quantized version of the inputs to a respective first layer in the group and generates as output a quantized version of the outputs of a respective last layer in the group. The tree depth of the respective decision tree is based at least in part on a number of layers of the group.”)
Coelho and Yu are related to the same field of endeavor (i.e.: neural network compression). It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to combine the teaching of Coelho with teachings of Yu by replacing groups of adjacent layers with quantized decision tree, reducing computation and memory while preserving model functionality, (Coelho, Abstract).
Yu in view of Coelho do not teach:
the pruning ratio indicating a percentage of filters to be pruned from the layers in the group;
pruning filters of the layers in the group based on the pruning ratio to generate compressed layers; and updating the DNN by replacing the layers in the group with the compressed layers
Li teaches:
the pruning ratio indicating a percentage of filters to be pruned from the layers in the group;
(Li, page: 9, “4.4 COMPARISON WITH PRUNING RANDOM FILTERS AND LARGEST FILTERS: We compare our approach with pruning random filters and largest filters. As shown in Figure 8, pruning the smallest filters outperforms pruning random filters for most of the layers at different pruning ratios. For example, smallest filter pruning has better accuracy than random filter pruning for all layers with the pruning ratio of 90% [the pruning ratio indicating a percentage of filters to be pruned from the layers in the group] (i.e.: pruning ratio of 90% means that 90% of filters were removed, leaving only 10% of the original filters). The accuracy of pruning filters with the largest `1-norms drops quickly as the pruning ratio increases, which indicates the importance of filters with larger `1-norms.”)
pruning filters of the layers in the group based on the pruning ratio to generate compressed layers; and updating the DNN by replacing the layers in the group with the compressed layers.
(Li, page: 6, “We implement our filter pruning method in Torch7 (Collobert et al. (2011)). When filters are pruned [pruning filters of the layers in the group based on the pruning ratio], a new model with fewer filters is created [updating the DNN by replacing the layers in the group with the compressed layers] (i.e.: after pruning, the updated DNN includes the modified (compressed) layers, replacing the original unpruned layers) and the remaining parameters of the modified layers as well as the unaffected layers are copied into the new model [to generate compressed layers]. Furthermore, if a convolutional layer is pruned, the weights of the subsequent batch normalization layer are also removed. To get the baseline accuracies for each network, we train each model from scratch and follow the same pre-processing and hyper-parameters as ResNet (He et al. (2016)).”)
Li, Yu and Coelho are related to the same field of endeavor (i.e.: neural network compression). It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to combine the teaching of Li with teachings of Yu and Coelho to add a layer-specific sensitivity analysis to the pruning process, allowing for more targeted pruning based on the accuracy impact of each layer (Li, Abstract).
Regarding claim 2, Yu in view of Li and Coelho teach the method of claim 1.
Yu further teaches: wherein determining the pruning ratio for the group based on the graph representations of the layers in the group comprises: inputting the graph representations of the layers in the group into a pre-trained graph representation neural network, the pre-trained graph representation neural network outputting the pruning ratio.
(Yu, page: 5, “Specifically, we perform FLOPs-constrained model compression using structured channel pruning (filter pruning) on the DNN’s convolutional layers, which are the most computationally intensive. Thus, the GNN-RL agent’s action space A ∈ R N×1 , where the N is the number of pruning layers, is the pruning ratios for hidden layers: A = ai , where i = {1, 2, ..., N}, and ai ∈ [0, 1) is the pruning ratio for i th layer. The GNN-RL [into a pre-trained graph representation neural network] agent makes the actions directly from the topology states: g = GraphEncoder(G l ) (6) A = MLP(g) (7) , where the G l is the environment states, g is the graph representation [inputting the graph representations of the layers in the group] (i.e.: the Graph Encoder is responsible for processing the environment state S’ to generate the graph representation g), The MLP is a multi-layer perception neural network. The graph encoder learns the topology embedding, and the MLP projects the embedding into hidden layers’ pruning ratios [the pre-trained graph representation neural network outputting the pruning ratio] (i.e.: The MLP takes the graph embedding g and outputs pruning ratios A = (a1, a2, …, aN)). The reward function is defined in Equation 8. Rerr = −Error (8) , where the Error is the compressed DNN’s Top-1 error on validation set.”)
Claim(s) 12 and 22 recite analogous limitation as claim 2, so are rejected under similar rationale.
Regarding claim 3, Yu in view of Li and Coelho teach the method of claim 2.
Yu further teaches: further training the graph representation neural network by using the pruning ratio and the graph representations of the layers in the group as a new training sample.
(Yu, page: 3, “To prune a given DNN, the user provides the model size constraint (i.e., FLOPs constraint). Although we perform FLOPs-constraint filter pruning, our method is not limited to FLOPs-constraint and can be easily extended to latency, MACs, or sparsity constraint compression. Figure 1 illustrates the DNN-Graph search episode, which is essentially a model compression iteration. Red arrows show that the process starts from the original DNN. The model size evaluator first evaluates the size of the DNN. If the size is not satisfied, the graph generator converts the DNN into a hierarchical computational graph. Then, the GNN-RL agent leverages m-GNN to learn pruning ratios from the graph. The pruner prunes the DNN with the pruning ratios [further training the graph representation neural network by using the pruning ratio] (i.e.: learned pruning ratio used to influences future training iterations) and begins the next iteration from the compressed DNN. Each step of the compression will change the network topology. Thus, the DNN-Graph environment reconstructs a new hierarchical computational graph [and the graph representations of the layers in the group as a new training sample] (i.e.: since pruning alters the network structure, a new graph representation is generated after each pruning step, meaning the GNN-RL agent continuously receives updated graph representations as training samples) for the GNN-RL agent corresponding to the current compression state. Once the compressed DNN satisfies the size constraint, the evaluator will end the episode, and the accuracy evaluator will assess the pruned DNN’s accuracy as an episode reward for the GNN-RL agent.”)
Claim(s) 13 and 23 recite analogous limitation as claim 3, so are rejected under similar rationale.
Regarding claim 6, Yu in view of Li and Coelho teach the method of claim 1.
Yu further teaches: wherein generating the sequence of graph representations comprises: for each respective layer of the plurality of layers: determining one or more attributes of the respective layer; and generating a graph representation in the sequence based on the one or more attributes.
(Yu, page: 3, “Formally, we model the DNN as an l-layer hierarchical computational graph, such that at the l th layer (the top layer) we would have the hierarchical computational graph set G l = {Gl}, where each item is a computational graph Gl = (V l , E l , G l−1 ). V l is the graph nodes corresponding to hidden states. E l is the set of directed edges with a specific edge type associated with the operations. Lastly, G l−1 = {G l−1 0 , Gl−1 1 , ...} is the computational graph set at the (l − 1)-layer and the operation set at layer l [for each respective layer of the plurality of layers]. Within the first layer, we manually choose commonly used machine learning operations as the primitive operations for G 0 . As an example, Figure 2 illustrates the idea behind generating hierarchical computational graphs using a sample graph G, where the edges are operations and the nodes are hidden states. In the input graph, we choose three primitive operations G 0 = {1×1 conv, 3×3 conv, 3×3 max-pooling} corresponding to the three edge types [determining one or more attributes of the respective layer;] (i.e.: attribute, convolution kernel size of the graph). Then, we extract the repetitive subgraphs (i.e., G1 1 , G1 2 and G1 3 ), [generating a graph representation in the sequence based on the one or more attributes], each denoting a compound operation, and decompose the graph G into two hierarchical levels, as shown in Figure 2 (b) and (c). The level-1 computational graphs are motifs that correspond to the edges within the level-2 computational graph.”).
PNG
media_image1.png
176
496
media_image1.png
Greyscale
[AltContent: textbox ([generating a sequence of graph representations based on attributes of the plurality of layers])](Yu, fig. 2(a – b))
Claim 16 recites analogous limitation as claim 6, so is rejected under similar rationale.
Regarding claim 7, Yu in view of Li and Coelho teach the method of claim 6.
Yu further teaches: wherein the one or more attributes are selected from a group consisting of size of input data, size of output data, size of kernel, and some combination thereof.
(Yu, page: 3, “Formally, we model the DNN as an l-layer hierarchical computational graph, such that at the l th layer (the top layer) we would have the hierarchical computational graph set G l = {Gl}, where each item is a computational graph Gl = (V l , E l , G l−1 ). V l is the graph nodes corresponding to hidden states. E l is the set of directed edges with a specific edge type associated with the operations. Lastly, G l−1 = {G l−1 0 , Gl−1 1 , ...} is the computational graph set at the (l − 1)-layer [size of input data, size of output data] and the operation set at layer l. Within the first layer, we manually choose commonly used machine learning operations as the primitive operations for G 0 . As an example, Figure 2 illustrates the idea behind generating hierarchical computational graphs using a sample graph G, where the edges are operations and the nodes are hidden states. In the input graph, we choose three primitive operations G 0 = {1×1 conv, 3×3 conv, 3×3 max-pooling} corresponding to the three edge types [wherein the one or more attributes are selected from a group consisting of size of input data]. Then, we extract the repetitive subgraphs (i.e., G1 1 , G1 2 and G1 3 ), each denoting a compound operation, and decompose the graph G into two hierarchical levels, as shown in Figure 2 (b) and (c). The level-1 computational graphs are motifs that correspond to the edges within the level-2 computational graph.”).
Claim 17 recites analogous limitation as claim 7, so is rejected under similar rationale.
Regarding claim 11, Yu teaches: One or more non-transitory computer-readable media storing instructions executable to perform operations for compressing a deep neural network (DNN), the operations comprising:
(Yu, page: 8, “4.3. Inference acceleration and memory saving: The inference and memory usage [One or more non-transitory computer-readable media storing instructions executable to perform operations] of compressed DNNs [for compressing a deep neural network (DNN)] are essential metrics to determine the possibility of DNN deployment on a given platform. Thus, we evaluated the pruned models’ inference latency using PyTorch 1.7.1 on an Nvidia GTX 1080Ti GPU and recorded the GPU memory usages. The ResNet-110/56/44/32/20 are measured on the CIFAR-10 test set with batch size 32. The VGG-16 is evaluated on the ImageNet test set with batch size 32. Lastly, MobileNet-v1/v2 and ShuffleNet-v1/v2 are measured on the CIFAR-100 with batch size 32.”)
The rest of the limitations are analogous to claim 1, so are rejected under similar rationale.
Regarding claim 21, Yu teaches: An apparatus for compressing a deep neural network (DNN), the apparatus comprising: a computer processor for executing computer program instructions; and a non-transitory computer-readable memory storing computer program instructions executable by the computer processor to perform operations comprising:
(Yu, page: 8, “4.3. Inference acceleration and memory saving: The inference and memory usage [a non-transitory computer-readable memory storing computer program instructions executable by the computer processor to perform operations] of compressed DNNs [An apparatus for compressing a deep neural network (DNN)] are essential metrics to determine the possibility of DNN deployment on a given platform. Thus, we evaluated the pruned models’ inference latency using PyTorch 1.7.1 on an Nvidia GTX 1080Ti GPU [the apparatus comprising: a computer processor for executing computer program instructions] and recorded the GPU memory usages. The ResNet-110/56/44/32/20 are measured on the CIFAR-10 test set with batch size 32. The VGG-16 is evaluated on the ImageNet test set with batch size 32. Lastly, MobileNet-v1/v2 and ShuffleNet-v1/v2 are measured on the CIFAR-100 with batch size 32.”)
The rest of the limitations are analogous to claim 1, so are rejected under similar rationale.
Claim(s) 4 – 5 and 14 – 15 are rejected under 35 U.S.C. 103 as being unpatentable over Yu in view of Li and Coelho and in further view of Asad et al., Pub. No.: US20220253709A1, (hereafter Asad).
Regarding claim 4, Yu in view of Li and Coelho teach the method of claim 3.
Yu further teaches: wherein further training the graph representation neural network comprises: determining whether an accuracy of the updated DNN is higher than a target accuracy;
(Yu, page: 3, “Each step of the compression will lead to DNN’s topology change. Thus, the DNN-Graph environment reconstructs a new hierarchical computational graph for the GNN-RL agent corresponding to the current compression state. Once the compressed DNN satisfies [is higher than a target accuracy] the size constraint, the evaluator will end the episode, and the accuracy [determining whether an accuracy of the updated DNN] evaluator will assess the pruned DNN’s accuracy as an episode reward for the GNN-RL agent.”)
Yu in view of Li and Coelho do not teach:
in response to determining that the accuracy of the updated DNN is higher than the target accuracy, using the pruning ratio and the graph representations of the layers in the group as a positive training sample.
Asad teaches:
in response to determining that the accuracy of the updated DNN is higher than the target accuracy, using the pruning ratio and the graph representations of the layers in the group as a positive training sample.
(Asad, “[0141] Pruner logic 402 a shown in FIG. 7a includes quantile logic 706, which is configured to determine a threshold in dependence on the sparsity parameter, sj σ, and the set of coefficients comprising absolute coefficient values. For example, the sparsity parameter may indicate a percentage of sparsity to be applied to a set of coefficients—e.g. 40% [using the pruning ratio] (i.e.: the quantile logic 706 determines a threshold value for pruning based on a sparsity parameter (e.g., 40%)). In this example, quantile logic 706 would determine a threshold value, below which 40% of the absolute coefficient values exist. In this example, the quantile logic can be described as using a non-differentiable quantile methodology. That is, the quantile logic 706 shown in FIG. 7a does not attempt to model the set of coefficients using a function, but rather empirically sorts the absolute coefficient values (e.g. in ascending or descending order) and sets the threshold at the appropriate value. For example, quantile logic 706 may determine a threshold τ in accordance with Equation (1).
τ=Quantile(abs(w j),s j σ) (1)
[0142] Pruner logic 402 a comprises subtraction logic 708, which is configured to subtract the threshold value determined by quantile logic 706 from each of the determined absolute coefficient values. In FIGS. 7a to d, the “minus” symbol on one of the inputs to subtraction logic (e.g. subtraction logic 708 in FIG. 7a ) is used to show that that input is being subtracted from the other input, labelled with a “plus” symbol. As a result, any of the absolute coefficient values having a value less than the threshold value will be represented by a negative number, whilst any of the absolute coefficient values having a value greater than the threshold value [and in response to determining that the accuracy of the updated DNN is higher than the target accuracy] will be represented by a positive number [and the graph representations of the layers in the group as a positive training sample] (i.e.: Subtraction logic 708 then compares each coefficient’s absolute value against the threshold, marking those coefficients as either positive (important) or negative (to be pruned). Positive values represent coefficients that should be retained). In this way, pruner logic 402 a has identified the least salient coefficients (e.g. the coefficients of least importance to the set of coefficients). In this example, the least salient coefficients are those having an absolute value below the threshold value. In other words, the pruner logic has identified the required percentage of coefficients in the input set of coefficients, wj, having a value closest to zero.”)
Asad, Yu, Li and Coelho are related to the same field of endeavor (i.e.: neural network compression). It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to combine the teaching of Asad with teachings of Yu, Li and Coelho to enhance computational efficiency and memory usage while maintaining structured sparsity, which can lead to faster inference and reduced storage requirements without significantly impacting model performance, (Asad, ¶[0006] [0009]).
Claim 14 recites analogous limitation as claim 4, so is rejected under similar rationale.
Regarding claim 5, Yu in view of Li, Coelho and Asad teach the method of claim 4.
Asad further teaches: wherein further training the graph representation neural network further comprises: in response to determining that the accuracy of the updated DNN is lower than the target accuracy, using the pruning ratio and the graph representations of the layers in the group as a negative training sample.
(Asad, “[0141] Pruner logic 402 a shown in FIG. 7a includes quantile logic 706, which is configured to determine a threshold in dependence on the sparsity parameter, sj σ, and the set of coefficients comprising absolute coefficient values. For example, the sparsity parameter may indicate a percentage of sparsity to be applied to a set of coefficients—e.g. 40% [using the pruning ratio] (i.e.: the quantile logic 706 determines a threshold value for pruning based on a sparsity parameter (e.g., 40%)). In this example, quantile logic 706 would determine a threshold value, below which 40% of the absolute coefficient values exist. In this example, the quantile logic can be described as using a non-differentiable quantile methodology. That is, the quantile logic 706 shown in FIG. 7a does not attempt to model the set of coefficients using a function, but rather empirically sorts the absolute coefficient values (e.g. in ascending or descending order) and sets the threshold at the appropriate value. For example, quantile logic 706 may determine a threshold τ in accordance with Equation (1).
τ=Quantile(abs(w j),s j σ) (1)
[0142] Pruner logic 402 a comprises subtraction logic 708, which is configured to subtract the threshold value determined by quantile logic 706 from each of the determined absolute coefficient values. In FIGS. 7a to d, the “minus” symbol on one of the inputs to subtraction logic (e.g. subtraction logic 708 in FIG. 7a ) is used to show that that input is being subtracted from the other input, labelled with a “plus” symbol. As a result, any of the absolute coefficient values having a value less than the threshold value [in response to determining that the accuracy of the updated DNN is lower than the target accuracy] will be represented by a negative number [and the graph representations of the layers in the group as a negative training sample] (i.e.: Subtraction logic 708 then compares each coefficient’s absolute value against the threshold, marking those coefficients as either positive (important) or negative (to be pruned). negative values represent coefficients that should be pruned), whilst any of the absolute coefficient values having a value greater than the threshold value will be represented by a positive number. In this way, pruner logic 402 a has identified the least salient coefficients (e.g. the coefficients of least importance to the set of coefficients). In this example, the least salient coefficients are those having an absolute value below the threshold value. In other words, the pruner logic has identified the required percentage of coefficients in the input set of coefficients, wj, having a value closest to zero.”)
It would have been obvious to one of ordinary skill in the art before the effective filling date of the present application to combine the teachings of Asad with teaching of Yu, Li and Coelho for the same reasons disclosed for claim 4.
Claim 15 recites analogous limitation as claim 5, so is rejected under similar rationale.
Claim(s) 8, 18 and 24 are rejected under 35 U.S.C. 103 as being unpatentable over Yu in view of Li and Coelho and in further view of Li Y. et al., Pub. No.: US20200279151A1, (hereafter Li Y).
Regarding claim 8, Yu in view of Li and Coelho teach the method of claim 6.
Yu in view of Li and Coelho do not teach:
wherein the DNN further comprises activations configured to apply activation functions on outputs of some of the plurality of layers; and generating the sequence of graph representations further comprises: for each respective activation of the activations: determining one or more attributes of the respective activation, and generating a graph representation in the sequence based on the one or more attributes.
Li Y. teaches:
wherein the DNN further comprises activations configured to apply activation functions on outputs of some of the plurality of layers;
(Li Y., “[0122] where f.sub.ae is an MLP having a linear activation function the output [on outputs of some of the plurality of layers] of which is fed into a layer having a sigmoid activation function [activations configured to apply activation functions] denoted by σ, h.sub.G is a graph state vector and h.sub.v.sup.(T) is an updated node state vector of a candidate node v 107 after T rounds of information propagation. In this implementation, the output of the edge addition neural network 102 is a probability of adding an edge connected to the candidate node 107 with edge typing handled by the node selection neural network 103. However, it will be appreciated that edge typing may be handled by the edge addition neural network 102 in similar manner described above with respect to node types.”)
generating the sequence of graph representations further comprises:
(Li Y., “[0130] The generative graph model and in particular the probability distributions represented by the one or more neural networks defines a joint distribution p(G,π) over graphs G and node and edge orderings π. As will be appreciated from the above, generating a graph using the system 100 generates both a graph and a particular ordering of nodes and edges based upon the graph construction sequence [and generating the sequence of graph representations]. For training of the one or more neural networks, optimizing the logarithm of marginal likelihood p(G)=Σπ∈P(G)p(G,π) may be used. However, optimization of log p(G) may be intractable for large graphs.”)
for each respective activation of the activations: determining one or more attributes of the respective activation, and
(Li Y., “[0119] where f.sub.an is an MLP having a linear activation function [for each respective activation of the activations] the output of which is fed into a layer having a sigmoid activation function [determining one or more attributes of the respective activation] (i.e.: attribute of the activation function in this case is linear for the MLP layers and sigmoid or softmax for the output layer) denoted by σ, and h.sub.G is a graph state vector. Where nodes may be one of K types, the output of f.sub.an may be a K+1 dimensional vector representing a score for adding a node of each type and also a score for not adding a node. The sigmoid layer above may be replaced with a softmax layer to convert the scores to probabilities as shown below:”)
generating a graph representation in the sequence based on the one or more attributes.
(Li Y., “[0113] The aggregation to generate a graph [generating a graph representation in the sequence] state vector may be based upon a gated sum. For example, a gating neural network comprising a final sigmoid activation function layer [based on the one or more attributes] and one or more lower layers may be used to process a node state vector (irrespective of dimensionality) to obtain a set of gating weights associated with of the node state vector. The set of gating weights may comprise a weight for each element of the node state vector. The aggregation may be based upon a sum of the node state vectors having their corresponding gating weights applied.”)
Li Y., Yu, Li and Coelho are related to the same field of endeavor (i.e.: neural network compression). It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to combine the teaching of Li Y. with teachings of Yu, Li and Coelho to allow for automated, data-driven solutions to diverse technical challenges, improving efficiency, innovation, and problem-solving across multiple domains, (Li Y., Abstract).
Claim(s) 18 and 24 recite analogous limitation as claim 8, so are rejected under similar rationale.
Claim(s) 9 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Yu in view of Li and Coelho and in further view of Ranjan, et al., "Asap: Adaptive structure aware pooling for learning hierarchical graph representations.", (hereafter Ranjan) and He, et al., "An effective classifier based on convolutional neural network and regularized extreme learning machine." , (hereafter He).
Regarding claim 9, Yu in view of Li and Coelho teach the method of claim 1.
Yu in view of Li and Coelho do not teach:
wherein clustering the plurality of layers into groups based on the sequence of graph representations comprises: inputting the graphs of the layers and an evaluation metric into a graph pooling model, the graph pooling model outputting the groups,
wherein the evaluation metric comprises a target accuracy of the updated DNN.
Ranjan teaches:
wherein clustering the plurality of layers into groups based on the sequence of graph representations comprises: inputting the graphs of the layers and an evaluation metric into a graph pooling model, the graph pooling model outputting the groups,
PNG
media_image3.png
317
788
media_image3.png
Greyscale
[AltContent: textbox ([inputting the graphs of the layers])][AltContent: textbox ([the graph pooling model outputting the groups])](Ranjan, fig. 1)
(Ranjan, page: 5476, “7.5 Effect of computing Soft edge weights We evaluate [an evaluation metric] the importance of calculating edge weights for the pooled graph as defined in Eq. 10.We use the best model configuration as found from above ablation analysis and then add the feature of computing soft edge weights for clusters. We observe a significant drop in performance when the edge weights are not computed. This proves the necessity of capturing the edge information while pooling graphs [into a graph pooling model].”)
Ranjan, Yu, Li and Coelho are related to the same field of endeavor (i.e.: neural network compression). It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to combine the teaching of Ranjan with teachings of Yu, Li and Coelho by leveraging a self-attention network and a modified GNN formulation, it improves node importance estimation and enables sparse, differentiable pooling, leading to more efficient and expressive graph representations, (Ranjan, Abstract).
Yu in view of Li, Coelho and Ranjan do not teach:
wherein the evaluation metric comprises a target accuracy of the updated DNN.
He teaches:
wherein the evaluation metric comprises a target accuracy of the updated DNN.
(He, “An effective classifier CNN-RELM is proposed in this paper. Firstly, the CNN-RELM trains the convolutional neural network using the gradient descent method until the learning target accuracy reaches [wherein the evaluation metric comprises a target accuracy of the updated DNN]. Then the fully connected layer of CNN is replaced by RELM optimized by genetic algorithm and the rest layers of the CNN remain unchanged. A series of experiments conducted on ORL and NUST databases show that the CNN-RELM outperforms CNN and RELM in classification and demonstrate the efficiency and accuracy of the proposed CNN-RELM model. Meanwhile, we also verify that the selection of different pooling methods has an impact on the performance of CNN-RELM. When the same number of training samples is selected, the pooling strategy has the highest recognition rate. In practical applications, the selection of appropriate pooling methods according to the actual situation of data is conducive to achieving better application results. Due to the uniting of CNN and RELM, CNN-RELM have the advantages of CNN and RELM and it is easier to learn and faster in testing. The future work includes improve the generalized ability and further reduce the training time.”)
He, Yu, Li, Coelho and Ranjan are related to the same field of endeavor (i.e.: neural network compression). It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to combine the teaching of He with teachings of Yu, Li, Coelho and Ranjan to enhance efficiency in training and testing while maintaining strong feature extraction and generalization capabilities, (He, Abstract).
Claim 19 recites analogous limitation as claim 9, so is rejected under similar rationale.
Claim(s) 10, 20 and 25 are rejected under 35 U.S.C. 103 as being unpatentable over Yu in view of Li and Coelho and in further view of GUAN, et al., Pub. No.: WO2021195643A1, (hereafter GUAN).
Regarding claim 10, Yu in view of Li and Coelho teach the method of claim 1.
Li further teaches: wherein pruning the filters of the layers in the group based on the pruning ratio
comprises: ranking the filters based on magnitudes of weights in the filters;
PNG
media_image5.png
336
372
media_image5.png
Greyscale
[AltContent: textbox ([ranking the filters])](Li, fig. 2(a))
(Li, 4, “Figure 2: (a) Sorting filters by absolute weights sum for each layer of VGG-16 on CIFAR-10. The x-axis is the filter index divided by the total number of filters. The y-axis is the filter weight sum [based on magnitudes of weights in the filters;] divided by the max sum value among filters in that layer. (b) Pruning filters with the lowest absolute weights sum and their corresponding test accuracies on CIFAR-10. (c) Prune and retrain for each single layer of VGG-16 on CIFAR-10. Some layers are sensitive and it can be harder to recover accuracy after pruning them”)
changing magnitudes of weights in the one or more filters to zero.
(Li, page: 4, “Relationship to group-sparse regularization on filters Recent work (Zhou et al. (2016); Wen et al. (2016)) apply group-sparse regularization (Pni j=1 kFi,jk2 or `2,1-norm) on convolutional filters, which also favor to zero-out filters with small l2-norms, i.e. Fi,j = 0. In practice, we do not observe noticeable difference between the `2-norm and the `1-norm for filter selection, as the important filters tend to have large values for both measures (Appendix 6.1). Zeroing out weights of multiple filters [changing magnitudes of weights in the one or more filters to zero] during training has a similar effect to pruning filters with the strategy of iterative pruning and retraining as introduced in Section 3.4”)
It would have been obvious to one of ordinary skill in the art before the effective filling date of the present application to combine the teachings of Li with teaching of Yu and Coelho for the same reasons disclosed for claim 1.
Yu in view of Li and Coelho do not teach:
selecting one or more filters from the filters based on the ranking and the pruning ratio;
GUAN teaches:
selecting one or more filters from the filters based on the ranking and the pruning ratio;
(GUAN , “[0060] Specifically, for each pruned NN model 910, a respective distinct set of importance coefficients (e.g., those in the pruning setting 952)[ and the pruning ratio] are assigned to each of the plurality of layers 906 in the NN model 902. For example, a first layer 906A is assigned with a first set of importance coefficients (e.g., ai and bi ), and a second layer 906B is assigned with a second set of importance coefficients (e.g., <¾ and bi). A third layer 906C is assigned with a third set of importance coefficients (e.g., and bi), and a fourth layer 906D is assigned with a fourth set of importance coefficients (e.g., a4 and bi). An importance score / is determined for each filter 908 based on the respective subset of importance coefficients of a respective layer 906 to which the respective filter 908 belongs. For example, the filter 908A is included in the third layer 906C, and an importance score / is determined based on the third subset of importance coefficients and b assigned to the third layer 906C. The filters 908 of the entire NN model 902 are ranked based on the importance score / of each filter 908. In accordance with ranking of the filters 908, a respective subset of filters 908 are removed based on their importance score /, thereby allowing the NN model 902 to be pruned to the respective pruned NN model 910. Specifically, each of the plurality of pruned NN models 910 has a pruned number of filters 908 satisfying a predefined difference value, percentage, or FLOPS number, and the pruned number of top-ranked filters 908 [based on the ranking] are selected [selecting one or more filters from the filters] based on the importance score / of each filter 908 to generate the respective pruned NN model 910.”)
GUAN, Yu, Li and Coelho are related to the same field of endeavor (i.e.: neural network compression). It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to combine the teaching of GUAN with teachings of Yu, Li and Coelho to improve model stability, maintain accuracy, and optimize computational efficiency, making it well-suited for deployment on resource-constrained systems, (GUAN, Abstract).
Claim(s) 20 and 25 recite analogous limitation as claim 10, so are rejected under similar rationale.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
He, et al. "Deep residual learning for image recognition", 2016.
This paper address the degradation problem by introducing a deep residual learning framework. Instead of hoping each few stacked layers directly fit a desired underlying mapping, we explicitly let these layers fit a residual mapping.
He, Kaiming, and Jian Sun. "Convolutional neural networks at constrained time cost." (2015).
The paper describes accuracy of CNNs under constrained time cost. Under this constraint, the designs of the network architectures should exhibit as trade-offs among the factors like depth, numbers of filters, filter sizes, etc.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner
should be directed to MATIYAS T MARU whose telephone number is (571)270-0902. The examiner
can normally be reached Monday - Friday (8:00am - 4:00pm) EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a
USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to
use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor,
Michelle Bechtold can be reached on (571)431-0762. The fax phone number for the organization were this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from
Patent Center. Unpublished application information in Patent Center is available to registered users.
To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit
https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and
https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional
questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like
assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA)
or 571-272-1000.
/M.T.M./ Examiner, Art Unit 2148
/MICHELLE T BECHTOLD/ Supervisory Patent Examiner, Art Unit 2148