Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Examiner notes the entry of the following papers:
Amended claims filed 11/20/2025.
Applicants arguments/remarks filed 11/20/2025.
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 11/20/2025 has been entered.
Claims 1, 11, and 20 are amended. Claims 4 and 14 are canceled. Claims 1-3, 5-13, and 15-20 are presented for examination.
Response to Arguments
Applicant presents arguments. Each is addressed.
Applicant remarks “Accordingly, Applicant requests withdrawal of the rejection under 35 U.S.C. § 101.” (Remarks, page 7, paragraph 7, line 1.) Examiner is persuaded by the argument that the claimed invention cannot be practically performed by a human mind. The rejections under 35 U.S.C. § 101 are withdrawn.
Applicant argues that “Moreover, Chen’s “g’” variable is not a “layer-specific input.” (Remarks, page 10, paragraph 2, line 1.) Examiner notes, that the specification recites “In particular, the gating functionality component takes as input simple statistics of the input to a layer (or any of the layers before or after it, including the output function).” (Specification, paragraph [0062], line 4.) (Underline added by examiner.) In other words, the input that is “layer-specific” is not a dataset input to the overall network, but is derived from the input dataset. In Chen, “g” is derived from an input dataset. (Chen, page 3, column 2, paragraph 2, line 1 “From the above definition we can see that, the gater network learns a function which maps input x to a binary gating vector g.”). Similarly gl is the filter vector for layer l .” and (Chen, page 3, column 2, paragraph 2, line 5 “Here gl i is the entry in g corresponding to the i-th filter at layer l, and 0 is a 2-D feature map with all its elements being 0.”) And, (Chen, page 3, column 2, paragraph 2, line 1. “Instead, the gater network processes the input to generate an input-dependent gating mask – a binary vector.”) In other words, layer specific input is used to generate vector g, and gl is the layer specific filter vector. Therefore, the rejection is proper and maintained.
Applicant argues “The above fails to mention any ‘filter selection process’ or being activated or deactivated based on the identified relevance as asserted by the Office Action. (Remarks, page 11, paragraph 2, line 1.) Examiner notes that the limitation in question recites “wherein, each of the two or more filters are selectively activated or deactivated based on the identified relevance;” (Claim 1, line 11.) Therefore, the limitation does not require a description of a selection process, but instead requires that …each of the two or more filters being selectively activated or deactivated…. Examiner notes that Chen is used to teach “a method of activating gating within a current layer of a neural network that includes two or more filters” (Chen, Figure 1, and, page 1, column 2, paragraph 2, line 1. See mapping of claim 1.) For clarity, Chen also teaches deactivating filters. (Chen, page 4, column 2, paragraph 2, line 2 “Firstly, binary gates can completely deactivate some filters for each input, and hence those filters will not be influenced by the irrelevant inputs.” Under broadest reasonable interpretation, any reason or method for activating and deactivating filters is construed as “selectively activating and deactivating filters”. Since the claim does not require a description of the process, the rejection is proper and maintained.
Applicant argues “Csurka falls short of the claimed features “each of the two or more filters are selectively activated or deactivated based on the identified relevance.” (Remarks, page 11, paragraph 2, line 2.) As described above, activating and deactivating filters is mapped to Chen. The argument that Csurka does not teach “based on the identified relevance” is moot in view of new grounds of rejection necessitated by amendment.
Applicant argues “While differing in scope, independent claims 11 and 20 have been amended to recite features that are similar to distinguishing features of claim 1 as discussed above. Therefore, it is respectfully submitted that claims 1, 11, and 20 are also in condition for allowance.” (Remarks, page 11, paragraph 5, line 1.) However, claim 1 remains rejected. Similarly, claims 11 and 20 remain rejected. The dependent claims remain rejected at least for depending from rejected base claims.
Subject Matter Eligibility
In determining whether the claims are subject matter eligible, the examiner has considered and applied the 2019 USPTO Patent Eligibility Guidelines, as well as guidance in the MPEP chapter 2106. The examiner finds that the independent claims do not recite a judicial exception.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-3, 5-6, 8-13, 15-16, and 18-20 are rejected under 35 U.S.C. § 103 as being unpatentable over Chen, et al (You Look Twice: GaterNet for Dynamic Filter Selection in CNNs, herein Chen), Veit et al (Convolutional Networks with Adaptive Inference Graphs, herein Veit), Zhang S. (From CDF to PDF A Density Estimation Method for High Dimensional Data, herein Zhang), and Alippi, C. (Moving Convolutional Neural Networks to Embedded Systems: the AlexNet and VGG-16 case, herein Alippi).
Regarding claim 1,
Chen teaches a method of activating gating within a current layer of a neural network that includes two or more filters (Chen, Figure 1, and, page 1, column 2, paragraph 2, line 1 “In this paper, we propose a novel framework called GaterNet for input-dependent dynamic filter selection in convolutional neural networks (CNNs), as shown in Figure 1.” And, page 2, column 1, paragraph 1, line 3 “The gating vector is then used to select the filters in the backbone-1 network (the main model in our framework), and only the selected filters in the backbone network participate in the prediction and learning.”
PNG
media_image1.png
449
554
media_image1.png
Greyscale
In other words, gating vector is used to select filters is activating gating, filters is two or more filters, and novel framework called GaterNet for input-dependent dynamic filter selection in convolutional neural networks is a method of activating gating that include two or more filters.) the method comprising:
receiving, by a processor in a computing device, a layer-specific input data that is specific to the current layer of the neural network (Chen, page 1, column 1, paragraph 2, line 5 “In machine learning, conditional computation [3] has been proposed to have a similar mechanism in deep learning models.” And, page 3, column 1, paragraph 3, line 3 “Given an input, the gater network decides the set of filters in the backbone network for use while the backbone network does the actual prediction.” Examiner notes, that the specification recites “In particular, the gating functionality component takes as input simple statistics of the input to a layer (or any of the layers before or after it, including the output function).” (Specification, paragraph [0062], line 4.) (Underline added by examiner.) In other words, the input that is “layer-specific” is not a dataset input to the overall network, but is derived from the input dataset. g is derived from input dataset. (Chen, page 3, column 2, paragraph 2, line 1 “From the above definition we can see that, the gater network learns a function which maps input x to a binary gating vector g.”). Similarly gl is the filter vector for layer l .” and (Chen, page 3, column 2, paragraph 2, line 5 “Here gl i is the entry in g corresponding to the i-th filter at layer l, and 0 is a 2-D feature map with all its elements being 0.”) The specification of the instant application recites “The output or activation of a first layer of nodes becomes an input to a second layer of nodes, the activation of a second layer of nodes becomes an input to a third layer of nodes, and so on.” (Specification, paragraph [0002], line 5.) Therefore, Examiner is interpreting “layer-specific input data” as the data that is input to a layer. Examiner further notes that one of ordinary skill in the art would understand the primary reference as teaching at least one processor with at least one non-transitory computer readable memory. “[I]n considering the disclosure of a reference, it is proper to take into account not only specific teachings of the reference but also the inferences which one skilled in the art would reasonably be expected to draw therefrom.” MPEP § 2144.01. In other words, computation is processor in a computing device, input is receiving…input, gl is layer specific input, and CNNs have layers is a current layer in a neural network.);
generating statistics based on the received layer-specific input data (Chen, page 3, column 2, paragraph 2, line 1 “From the above definition we can see that, the gater network
learns a function which maps input x to a binary gating vector g. With the help of g, we reformulate the computation of feature map Ol(x) in Equation (1) as below:
PNG
media_image2.png
71
446
media_image2.png
Greyscale
Here gli is the entry in g corresponding to the i-th filter at layer l, and 0 is a 2-D feature map with all its elements being 0.” And, page 4, column 1, paragraph 2, line 1 “So far, one important question still remains unanswered: how to generate binary gates g from g’ such that we can back-propagate the error through the discrete gates to the gater? In this paper, we adopt a method called Improved SemHash [17,18]. During training, we first draw noise from a c-dimensional Gaussian distribution with mean 0 and standard deviation 1. The noise
PNG
media_image3.png
13
10
media_image3.png
Greyscale
is added to g’ to get a noisy version of the vector:
PNG
media_image4.png
24
117
media_image4.png
Greyscale
Two vectors are then computed from
PNG
media_image5.png
26
22
media_image5.png
Greyscale
PNG
media_image6.png
28
306
media_image6.png
Greyscale
where
PNG
media_image7.png
18
20
media_image7.png
Greyscale
is the saturating sigmoid function [19, 16]:
PNG
media_image8.png
27
380
media_image8.png
Greyscale
with
PNG
media_image9.png
13
17
media_image9.png
Greyscale
being the sigmoid function.” In other words, vector g is derived from input, gl is derived from layer specific input, c-dimensional Gaussian distribution is statistics, and added to vector g’ is statistics based on input data to a specific layer.); using the generated statistics to
[assign a relevance score to each of the two or more filters, wherein each assigned relevance score indicates a relevance of a corresponding filter to the received layer-specific input data];
determining an activation status of each of the two or more filters in the current layer (Chen, page 4, column 2, paragraph 2, line 1 “We use binary gates other than attention [31] or other real-valued gates for two reasons. Firstly, binary gates can completely deactivate some filters for each input, and hence those filters will not be influenced by the irrelevant inputs.” And, page 1, column 1, paragraph 1, line 15 “…a global gater network is introduced to generate binary gates for selectively activating filters in the backbone network based on each input.” In other words, selectively activating filters is determining an activation status, and activating filters and deactivate some filters is determining the activation status of the two or more filters in the current layer.)
[based on the identified relevance] by
[matching a distribution of an output of a gating functionality within the neural network using a cumulative distribution function (CDF) loss function], wherein
each of the two or more filters are selectively activated or deactivated (Chen, page 4, column 2, paragraph 2, line 1 “We use binary gates other than attention [31] or other real-valued gates for two reasons. Firstly, binary gates can completely deactivate some filters for each input, and hence those filters will not be influenced by the irrelevant inputs.” And, page 1, column 1, paragraph 1, line 15 “…a global gater network is introduced to generate binary gates for selectively activating filters in the backbone network based on each input.” In other words, selectively is selectively, and activating some filters and deactivate some is filters is each of two or more filters are selectively activated or deactivated.)
[based on the identified relevance]; and
applying the received layer-specific input data to activated filters in the two or more filters to generate an output activation for the current layer of the neural network (Chen, page 3, column 1, paragraph 3, line 3 “Given an input, the gater network decides the set of filters in the backbone network for use while the backbone network does the actual prediction.” In other words, given an input is received input, set of filters is two or more filters, decides the set of filters for use in the backbone network is applying the received input to activated filters in the layer, and does the actual prediction is generate an activation.).
Thus far, Chen does not explicitly teach assign a relevance score to each of the two or more filters, wherein each assigned relevance score indicates a relevance of a corresponding filter to the received layer-specific input data.
Veit teaches assign a relevance score to each of the two or more filters, wherein each assigned relevance score indicates a relevance of a corresponding filter to the received layer-specific input data (Veit, Fig. 2., and page 4, paragraph 7, line 1 “For the gate to be effective, it needs to address a few key challenges. First to estimate the relevance of its layer, the gate needs to understand its input features. To prevent mode collapse into trivial solutions that are independent of the input features, such as always or never executing a layer, we found it to be of key importance for the gate to be stochastic. We achieve this adding noise to the estimated relevance. Second, the gate needs to make a discrete decision, while still providing gradients for the relevance estimation. We achieve this with the Gumbel-Max trick and its softmax relaxation. Third, the gate needs to operate with low computational cost. Figure 2 provides an overview of the two key components of the proposed gate. The first one efficiently estimates the relevance of the respective layer for the current image. The second component makes a discrete decision by sampling using Gumbel-Softmax [18,24].”
PNG
media_image10.png
436
950
media_image10.png
Greyscale
In other words, from Fig. 2, estimate a relevance is assign a relevance score and deciding whether to execute the layer given the estimated relevance is each assigned relevance score indicates a relevance of a corresponding filter to the received layer-specific input data.)
Both Chen and Veit are directed to speeding up inference in convolutional neural networks, among other things. Chen teaches using gates at the filter level to evaluate whether to activate filters in a layer, but does not explicitly teach determining whether filters are “relevant”. Veit teaches using gates at the layer level and Gumbel-Softmax to determine whether a layer is “relevant” and should be executed and to speed up backpropagation.
In view of the teaching of Chen it would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Veit into Chen. This would result in using gates at the filter level to evaluate whether to activate filters in a layer and applying Gumbel-Softmax at the filter level to estimate whether filters are relevant in order to further speed up inference.
One of ordinary skill in the art would be motivated to do this because the ever-increasing size of convolutional neural networks has caused a corresponding increase in inference time, thus creating a need for improving speed. One way of improving execution speed is through conditional execution. (Veit, page 1, paragraph 1, line 1 “Do convolutional networks really need a fixed feed-forward structure? What if, after identifying the high-level concept of an image, a network could move directly to a layer that can distinguish fine-
grained differences? “ And, paragraph 2, line 4 “To shed light on this, it is important to note that due to this success, ConvNets are used to classify increasingly large sets of visually diverse categories. Thus, most parameters model high-level features that, in contrast to low-level and many mid-level concepts, cannot be broadly shared across categories. As a result, the networks become larger and slower as the number of categories rises. Moreover, for any given input image the number of computed features focusing on unrelated concepts increases. What if, after identifying that an image contains a bird, a ConvNet could move directly to a layer that can distinguish different bird species, without executing intermediate layers that specialize in unrelated aspects?”)
Thus far, the combination of Chen and Veit does not explicitly teach matching a distribution of an output of a gating functionality within the neural network using a cumulative distribution function (CDF) loss function.
Zhang teaches matching a distribution of an output of a gating functionality within the neural network using a cumulative distribution function (CDF) loss function (Zhang, page 2, paragraph 5, line 1 “CDF2PDF is a method of PDF estimation by approximating CDF. The original idea of it was previously proposed in [1] called SIC (smooth interpolation of the cumulative). However, SIC requires additional hyper-parameter tuning, and no algorithms for computing higher order derivative from a trained NN are provided in [1]. CDF2PDF improves SIC by avoiding the time-consuming hyper-parameter tuning part and enabling higher order derivative computation to be done in polynomial time. Experiments of this method for one-dimensional data shows promising results.” And, page 3, paragraph 1, line 1 “SIC[1] uses a multilayer neural network which is trained to output the estimates of CDF. Given
PNG
media_image11.png
25
319
media_image11.png
Greyscale
be the true PDF, and
PNG
media_image12.png
21
44
media_image12.png
Greyscale
the corresponding CDF.” And, page 3, paragraph 4, line 1 “Once training is done and
PNG
media_image13.png
26
268
media_image13.png
Greyscale
well, the PDF can be directly inferred from
PNG
media_image14.png
26
70
media_image14.png
Greyscale
..” In other words, estimating PDF by approximating CDF is matching the output distribution (PDF) using a cumulative distribution function (CDF).).
Both Zhang and the combination of Chen and Veit are directed to speeding up inference in neural networks. The combination of Chen and Veit teaches a method of activating gating within a current layer of a neural network that includes two or more filters, but does not explicitly teach matching a distribution of an output of the gating functionality using a cumulative distribution function (CDF) loss function. Zhang teaches matching a distribution of an output of the gating functionality using a cumulative distribution function (CDF) loss function.
In view of the teaching of the combination of Chen and Veit, it would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Zhang into the combination of Chen and Veit. This would result in a method of activating gating within a current layer of a neural network that includes two or more filters, that uses matching a distribution of an output of the gating functionality using a cumulative distribution function (CDF) loss function.
One of ordinary skill in the art would be motivated to do this because higher dimensional data causes slower execution. One way of improving the speed of execution is better methods of estimating distributions. ( Zhang, page 2, paragraph 3, line 1 “Estimating probability density using neural networks (NN) was proposed a long time ago. These methods usually give satisfying performance for low dimensional data, and become ineffective or computationally-infeasible for very high dimensional data…CDF2PDF is a method of PDF estimation by approximating CDF. The original idea of it was previously proposed in [1] called SIC. However, SIC requires additional hyper-parameter tuning, and no algorithms for computing higher order derivative from a trained NN are provided in [1]. CDF2PDF improves SIC by avoiding the time-consuming hyper-parameter tuning part and enabling higher order derivative computation to be done in polynomial time.”)
Thus far, the combination of Chen, Veit, and Zhang does not explicitly teach based on the identified relevance.
Alippi teaches (each of two or more filters are selectively activated or deactivated- previously mapped to Chen, office action, page 9.) based on the identified relevance (Alippi, page 218, column 1, paragraph 4 “Once a specific
PNG
media_image15.png
21
16
media_image15.png
Greyscale
inΨ has been selected by the embedded system designer, we can perform an additional approximation to further reduce computational load and memory occupation on the given embedded-system platform. More specifically, when the k-th layer of Φk,p,μΘ is a convolutional layer (but this can be easily extended to all the layers whose processing does not require inter-filtering operations, e.g., cross-filter normalization, since the last convolutional layer), we can apply a filter-selection procedure to identify those filters providing output features that are truly beneficial to support the classification by μΘ and discard the others (i.e., those with negligible or negative effects on the specific classification problem).” In other words, filter-selection procedure is selecting filters, identify those filters...that are truly beneficial is identify relevance, and, select and discard is activate and deactivate filters.)
Both Alippi and the combination of Chen, Veit, and Zhang are directed to conditional computation, among other things. The combination of Chen, Veit, and Zhang teaches a method of activating gating within a current layer of a neural network that includes two or more filters, that uses matching a distribution of an output of the gating functionality using a cumulative distribution function (CDF) loss function; but does not explicitly teach each of the two or more filters are selectively activated or deactivated based on the identified relevance. Alippi teaches each of the two or more filters are selectively activated or deactivated based on the identified relevance.
In view of the teaching of the combination of Chen, Veit, and Zhang, it would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Alippi into the combination of Chen, Veit, and Zhang. This would result in a method of activating gating within a current layer of a neural network that includes two or more filters, that uses matching a distribution of an output of the gating functionality using a cumulative distribution function (CDF) loss function where each of the two or more filters are selectively activated or deactivated based on the identified relevance.
One of ordinary skill in the art would be motivated to do this because large volumes of unlabeled data are costly to process particularly for embedded systems, and alternative solutions of execution are needed to speed up execution and reduce cost. (Alippi, page 212, column 1, paragraph 1, line 1 “Execution of deep learning solutions is mostly restricted to high performing computing platforms, e.g., those endowed with GPUs or FPGAs, due to the high demand on computation and memory such solutions require. Despite the fact that dedicated hardware is nowadays subject of research and effective solutions exist, we envision a future where deep learning solutions -here Convolutional Neural Networks (CNNs)- are mostly executed by low-cost off-the shelf embedded platforms already available in the market.”)
Regarding claim 2,
The combination of Chen, Veit, Zhang, and Alippi teaches the method of claim 1, wherein generating statistics based on the received layer-specific input data comprises
performing global average pooling on the received layer-specific input data to reduce dimensionality of the received layer-specific input data (Veit, page 5, paragraph 2, line 3 “Since operating on the full feature map is computationally expensive, we build upon recent studies [13, 17, 23] which show that much of the information in convolutional features is captured by the statistics of the different channels and their interdependencies. In particular, we only consider channel-wise means gathered by global average pooling. This compresses the
input features into a 1 × 1 × C channel descriptor.” In other words, gather by global average pooling is performing global average pooling, and compresses the features is to reduce the dimensionality.).
Regarding claim 3,
The combination of Chen, Veit, Zhang, and Alippi teaches the method of claim 1, wherein determining an activation status of each of the two or more filters in the current layer based on the identified relevance comprises
implementing the gating functionality in a fully-connected two-layer Multilayer Perceptron (MLP) within the neural network, wherein the MLP is smaller than the neural network (Chen, page 4, column 1, paragraph 1, line 1 “Fully-Connected Layers with Bottleneck As defined in Equation (2), the function D(f) needs to map the vector f of size h to a binary vector g of size c. We first consider using fully-connected layers to map f to a real-valued vector
g’ of size c. If we use one single layer to project the vector, the projection matrix would be of size h x c. This can be very large when h is thousands and c is tens of thousands. To reduce the number of parameters in this projection, we use two fully-connected layers to fulfill the projection.” Examiner notes the specification recites “The neural network 100 illustrated in FIG. 1A includes fully-connected (FC) layers, which are also sometimes referred to as multi-layer perceptrons (MLPs).” (Specification, paragraph [0033], line 1.) This is the only mention of multi-layer perceptron in the specification. In addition, the claim recites “..fully-connected two-layer Multilayer Perceptron (MLP) within the neural network..”. Therefore, Examiner is interpreting fully-connected two-layer multilayer perceptron as two fully connected layers within the neural network. In other words, two fully-connected layers is fully-connected two-layer multilayer perceptron and MLP being within the network is the MLP is smaller than the neural network.)
Regarding claim 5,
The combination of Chen, Veit, Zhang, and Alippi teaches the method of claim 1, wherein the CDF loss function
measures a sum of squared differences between expected and actual cumulative distributions of samples (Zhang, page 3, paragraph 3, line 1 “After generating training data by either one of these estimation methods, a multilayer neural network H(->x;->Ɵ), where ->Ɵ stands for model parameters, is trained with a loss function for regression. Since there are many choices for the loss function, we chose the square loss function in this work.” In other words, estimation is expected, loss function is measures expected and actual distributions, and we chose the square loss function is measures a sum of squared differences.).
Regarding claim 6,
The combination of Chen, Veit, Zhang, and Alippi teaches the method of claim 3, wherein implementing the gating functionality in the fully-connected two-layer MLP within the neural network comprises
implementing the gating functionality without reliance on original input data (Chen, Figure 1. In other words, the model architecture is implementing gating functionality, it is created before receiving input data which is implementing the gating functionality without relying on the original input data. )
Regarding claim 8,
The combination of Chen, Veit, Zhang, and Alippi teaches the method of claim 1, wherein
each of the two or more filters in the layer are associated with a respective one of two or more gating functionality components (Chen, Figure 1. In other words, Figure 1 shows that the two or more filters in the layer are associated with two or more gating functionality components.).
Regarding claim 9,
The combination of Chen, Veit, Zhang, and Alippi teaches the method of claim 8, further comprising enforcing conditionality on the gating functionality components by
back propagating a loss function to approximate a discrete decision of at least one of the gating functionality components with a continuous representation (Chen, page 2, column 1, paragraph 3, line 10 “We propose a new framework for dynamic filter selection in CNNs. The core of the idea is to introduce a dedicated gater network to take a glimpse of the input, and then generate input-dependent binary gates to select filters in the backbone network for processing the input. By using Improved SemHash, the gater network can be jointly trained with the backbone in an end-to-end fashion through back-propagation.” And, page 4, column 1, paragraph 3, line 1 “During training, we first draw noise from a c-dimensional Gaussian distribution…” In other words, jointly trained with the backbone in an end-to-end fashion through back-propagation is enforcing conditionality on the gating functionality by back propagating a loss function, binary gates to select filters is gate functionality, and Gaussian distribution is with a continuous representation.), wherein
back propagating the loss function comprises performing Batch-wise conditional regularization operations (Veit, page 8, paragraph 4, line 4 “Since ResNet 110 might be over-parameterized for CIFAR-10, the regularization induced by dropping layers could be a key factor to performance.” In other words, dropping layers is conditional in reference to gated execution and regularization is regularization operations.) to
match batch-wise statistics of one or more of the gating functionality components to a prior distribution (Veit, Fig. 2, page 7, paragraph 3, line 6 “We approximate the execution rates for each layer over each mini-batch and penalize deviations from the target rate.” And, page 2, paragraph 2, line 1 “In this work, we propose ConvNet-AIG, a convolutional network that adaptively defines its inference graph conditioned on the input image. Specifically, ConvNet-AIG learns a set of convolutional layers and decides for each input image which layers are needed.” And page 2, paragraph 3, line 2 “The key difference is that for each residual layer, a gate determines whether the layer is needed for the current input image… To incorporate the discrete decisions, we build upon recent work [4,18, 24] that introduces differentiable approximations for discrete stochastic nodes in neural networks. In particular, we model the gates as discrete random variable over two states: to execute the respective layer or to skip it. Further, we model the gates conditional on the output of the previous layer.” And, page 6, paragraph 2, line 2 “For this, we build upon recent work that propose approaches for propagating gradients through stochastic neurons [4,20]. In particular, we utilize the Gumbel-Max trick [9] and its recent continuous relaxation [18, 24].” Examiner notes that it is known in the art that the Gumbel-Max trick (Specification of the instant application paragraphs [0062], and [0072 -0075]) samples from a categorical distribution, which is a prior distribution. (See Dinh, Gumbel-Max Trick Inference, page 1, paragraph 2, for reference. “The Gumbel-Max Trick is a method to sample from a categorical distribution…”. The Gumbel-Max Trick is recited in Veit. Dinh is only used for general description of the method and as support for the assertion of “known in the art”. Dinh is included in the PTO-892.) In other words, mini-batch is batch, approximate execution rates is batch-wise statistics, gates is at least one of the two or more gating functionality components, and utilize the Gumbel-Max trick is match batch-wise statistics to a prior distribution.).
Regarding claim 10,
The combination of Chen, Veit, Zhang, and Alippi teaches the method of claim 1, wherein receiving the layer-specific input data that is specific to the current layer of the neural network comprises
receiving a set of three-dimensional input feature maps that form a channel of input feature maps (Chen, page 3, column 1, paragraph 5, line 1 “Let us first consider a standalone backbone CNN without the gater network. Given an input image x, the output of the l-th convolutional layer is a 3-D feature map Ol(x). In a conventional CNN, Ol(x) is compute as:
PNG
media_image16.png
35
424
media_image16.png
Greyscale
where
PNG
media_image17.png
31
61
media_image17.png
Greyscale
is the i-th channel of feature map Ol(x),
PNG
media_image18.png
28
26
media_image18.png
Greyscale
is the i-th 3-D filter, Il(x) is the 3-D input feature map to the l-th layer,
PNG
media_image19.png
23
15
media_image19.png
Greyscale
denotes the element-wise nonlinear activation function, and * denotes convolution.” And page, 3, column 1, paragraph 3, line 3 “Given an input, the gater network decides the set of filters in the backbone network for use while the backbone network does the actual prediction.” In other words, 3-D input feature map is three-dimensional input feature map and set of filters is two or more three-dimensional filters.).
Claims 11-13, and 15-16 are computing device claims corresponding to method claims 1-3, and 5-6, respectively. Otherwise, they are the same. Veit teaches a computing device (Veit, page 3, paragraph 3, line 1 “Our approach can be seen as an example of adaptive computation for neural networks. Cascaded classifiers [32] have a long tradition for computer vision by quickly rejecting “easy” negatives. Recently, similar approaches have been proposed for neural networks [22, 33]. In an alternative direction, [3, 26] propose to adjust the amount of computation in fully-connected neural networks.” In other words, computer is computing device.) Therefore, claims 11-13, and 15-16 are rejected for the same reasons as claims 1-3, and 5-6, respectively.
Claim 18 is a computing device claim corresponding to the combination of method claims 8 and 9. Otherwise, it is the same. Therefore, claim 18 is rejected for the same reasons as the combination of method claims 8 and 9.
Claim 19 is a computing device claim corresponding to method claim 10. Otherwise, they are the same. Therefore, claim 19 is rejected for the same reasons as claim 10.
Claim 20 is a non-transitory processor-readable storage medium having stored thereon processor-executable instructions claim corresponding to method claim 1. Otherwise, they are the same. One of ordinary skill in the art would understand the primary reference as teaching a non-transitory computer-readable storage medium having stored thereon processor-executable instructions in order to execute the method of claim 1. “[I]n considering the disclosure of a reference, it is proper to take into account not only specific teachings of the reference but also the inferences which one skilled in the art would reasonably be expected to draw therefrom.” MPEP § 2144.01. Therefore, claim 20 is rejected for the same reasons as claim 1.
Claims 7 and 17 are rejected under 35 U.S.C. § 103 as being unpatentable over Chen, Veit, Zhang, Alippi, and Louizos et al (Learning Sparse Neural Networks Through L0 Regularization, herein Louizos).
Regarding claim 7,
The combination of Chen, Veit, Zhang, and Alippi teaches the method of claim 3, wherein implementing the gating functionality in the fully-connected two-layer MLP within the neural network comprises
Thus far, the combination of Chen, Veit, Zhang, and Alippi does not explicitly teach using a complexity loss term based on L0 regularization to achieve network sparsity.
Louizos teaches using a complexity loss term based on L0 regularization to achieve network sparsity (Louizos, page 1, abstract, line 1 “We propose a practical method for L0 norm regularization for neural networks: pruning the network during training by encouraging weights to become exactly zero. Such regularization is interesting since (1) it can greatly speed up training and inference, and (2) it can improve generalization.” And, page 2, paragraph 2, line 1 “One way to sparsify parametric models, such as deep neural networks, with the least assumptions about the parameters is the following; let D be a dataset consisting of N i.i.d. input output pairs {(x1; y1), …, (xN; yN)} and consider a regularized empirical risk minimization procedure with an L0 regularization on the parameters Ɵ of a hypothesis (e.g. a neural network) h(. ; Ɵ)1:… where |Ɵ| is the dimensionality of the parameters, λ is a weighting factor for the regularization and L(.) corresponds to a loss function, e.g. cross-entropy loss for classification or mean-squared error for regression.” In other words, L0 regularization is L0 regularization, to sparsify metrics is to achieve network sparsity, and L(.) is loss term based on L0 regularization.)
Both Louizos and the combination of Chen, Veit, Zhang, and Alippi are directed to speeding up execution of neural networks, among other things. The combination of Chen, Veit, Zhang, and Alippi teaches a method of activating gating within a current layer of a neural network that includes two or more filters, that uses matching a distribution of an output of the gating functionality using a cumulative distribution function (CDF) loss function; but does not explicitly teach using a complexity loss term based on L0 regularization to achieve network sparsity. Louizos teaches using a complexity loss term based on L0 regularization to achieve network sparsity.
In view of the teaching of the combination of Chen, Veit, Zhang, and Alippi, it would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Louizos into the combination of Chen, Veit, Zhang, and Alippi. This would result in a method of activating gating within a current layer of a neural network that includes two or more filters, that uses matching a distribution of an output of the gating functionality using a cumulative distribution function (CDF) loss function and using a complexity loss term based on L0 regularization to achieve network sparsity.
One of ordinary skill in the art would be motivated to do this because regularization can speed up execution. (Louizos, abstract line 1 “We propose a practical method for L0 norm regularization for neural networks: pruning the network during training by encouraging weights to become exactly zero. Such regularization is interesting since (1) it can greatly speed up training and inference, and (2) it can improve generalization.”)
Claim 17 is a computing device claim corresponding to method claim 7. Otherwise, they are the same. Therefore, claim 17 is rejected for the same reasons as claim 7.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BART RYLANDER whose telephone number is (571)272-8359. The examiner can normally be reached Monday - Thursday 8:00 to 5:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached at 571-270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/B.I.R./Examiner, Art Unit 2124
/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124