Office Action Analysis: 18157755 — Self-Pruning Neural Networks with Regularized Auxiliary Variables

Examiner Intelligence

PHAM, JESSICA THUY View full profile →
Grants only 33% of cases
Career Allow Rate
1 granted / 3 resolved
-21.7% vs TC avg
Minimal -33% lift
Without
With
+-33.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 3m
Avg Prosecution
38 currently pending
Career history
41
Total Applications
across all art units
Statute-Specific Performance

§101
26.8%
-13.2% vs TC avg
§103
35.5%
-4.5% vs TC avg
§102
11.0%
-29.0% vs TC avg
§112
22.7%
-17.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 3 resolved cases
Office Action

§101 §102 §103 §112
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Status of Claims
Claims 1-20 are pending and examined herein.
Claims 1-20 are objected to.
Claims 1-20 are rejected under 35 U.S.C. 112(b).
Claims 1-20 are rejected under 35 U.S.C. 101.
Claims 1, 3-8, 10-15, and 17-20 are rejected under 35 U.S.C. 102.
Claims 2, 9, and 16 are rejected under 35 U.S.C. 103.

Claim Objections
Claims 5, 12, and 19 objected to because of the following informalities: “a loss functions” should be “a loss function”.  Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Claims 1, 8, and 15 recite the limitation "the individual ones of the one or more neurons in the respective layers" in the third-to-last paragraph of each claim.  There is insufficient antecedent basis for this limitation in the claims. Note that the claims recite “individual ones of the respective layers”, which is not the same as "the individual ones of the one or more neurons in the respective layers"
Claims 1, 8, and 15 recite the limitation "the respective sets of auxiliary parameters" in the third-to-last paragraph of each claim. There is insufficient antecedent basis for this limitation in the claims. Note that the claims recite “a set of auxiliary parameters” which does not correspond to "the respective sets of auxiliary parameters"
Claims 1, 8, and 15 recite recites the limitation "the respective sets of weighting factors” in the third-to-last paragraph of each claim.  There is insufficient antecedent basis for this limitation in the claims. Note that the claims recite “a set of weighting factors” which does not correspond to "the respective sets of weighting factors".
Dependent claims 2-7, 9-14, and 16-20 fail to resolve the issues and are rejected with the same rationales.

Claim 6, 13, and 20 recite the limitation "the respective gating parameters".  There is insufficient antecedent basis for this limitation in the claim.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
MPEP § 2109(III) sets out steps for evaluating whether a claim is drawn to patent-eligible subject
matter. The analysis of claims 1-20 in accordance with these steps, follows.

Step 1 Analysis:
Step 1 is to determine whether the claim is directed to a statutory category (process, machine,
manufacture, or composition of matter. Claims 1-7 are directed to a process, claims 8-14 are directed to an article of manufacture, and claims 15-20 are directed to a machine. All claims are directed to statutory categories and analysis proceeds.

Step 2A Prong One, Step 2A Prong Two, and Step 2B Analysis:
Step 2A Prong One asks if the claim recites a judicial exception (abstract idea, law of nature, or natural phenomenon). If the claim recites a judicial exception, analysis proceeds to Step 2A Prong Two, which asks if the claim recites additional elements that integrate the abstract idea into a practical application. If the claim does not integrate the judicial exception, analysis proceeds to Step 2B, which asks if the claim amounts to significantly more than the judicial exception. If the claim does not amount to significantly more than the judicial exception, the claim is not eligible subject matter under 35 U.S.C. 101.
	None of the claims represent an improvement to technology.
	
	Regarding claim 1, the following are abstract ideas:
identifying one or more neurons for deletion according to the respective sets of auxiliary parameters and a regularization penalty for the neural network; and (Identifying a neuron for deletion according to parameters and a regularization penalty can be practically performed in the human mind. This is a mental process.)
	The following claim elements are additional elements which, taken alone or in combination with the other additional elements, do not integrate the judicial exception into a practical application nor amount to significantly more than the judicial exception:
	A method comprising:
training a neural network comprising a plurality of neuron layers, the training comprising performing, for a training batch of a plurality of training batches: (This recites generic machine learning components and processes; this amounts to mere instructions to apply an exception.)
training respective layers of the plurality of layers according to the training batch, wherein individual ones of the respective layers respectively comprise one or more neurons individually comprising a set of auxiliary parameters and a set of weighting factors, and wherein training individual ones of the respective layers updates the respective sets of auxiliary parameters and the respective sets of weighting factors of the individual ones of the one or more neurons in the respective layers; (This recites generic machine learning components and processes; this amounts to mere instructions to apply an exception.)
deleting the identified one or more neurons from the neural network prior to completion of the training batch. (This describes the generic machine learning process of structured pruning; this amounts to mere instructions to apply an exception.)

Regarding claim 2, the rejection of claim 1 is incorporated herein. Further, the following is an abstract idea:
integrating, subsequent to completion of training of the plurality of training batches, the respective auxiliary parameters into the respective sets of weighting factors for individual ones of the one or more neurons in the respective layers; and (Combining the auxiliary parameters into the weighting factors can be practically performed in the human mind, i.e. performing multiplication of the weighting factors by the value of the auxiliary parameters. This is a mental process.)
The following claim elements are additional elements which, taken alone or in combination with the other additional elements, do not integrate the judicial exception into a practical application nor amount to significantly more than the judicial exception:
removing the respective auxiliary parameters from the neural network. (This describes the generic machine learning process of unstructured pruning; this amounts to mere instructions to apply an exception.)

Regarding claim 3, the rejection of claim 1 is incorporated herein. Further, the following are abstract ideas:
deriving respective gating parameters for inputs to respective neurons of the layer according to the respective sets of auxiliary parameters; and (Deriving gating parameters can be practically performed in the human mind, i.e. performing a calculation to derive a gating parameter from the auxiliary parameters. This is a mental process.)
multiplying the respective gating parameters to the inputs to respective neurons to generate gated inputs to be applied to the respective sets of weighting factors of the neurons. (This is a mathematical calculation, which is a mathematical concept.)

Regarding claim 4, the rejection of claim 1 is incorporated herein. The following claim elements are additional elements which, taken alone or in combination with the other additional elements, do not integrate the judicial exception into a practical application nor amount to significantly more than the judicial exception:
wherein training the respective layers of the plurality of layers is performed using a stochastic gradient descent technique. (This is a generic machine learning process; this amounts to mere instructions to apply an exception.)

Regarding claim 5, the rejection of claim 4 is incorporated herein. Further, the following is an abstract idea:
wherein the stochastic gradient descent technique employs a loss functions comprising a differentiable regularization term which favors a lesser total number of auxiliary parameters in the network. (A loss function is a mathematical equation/formula, which is a mathematical concept.)

Regarding claim 6, the rejection of claim 4 is incorporated herein. The following claim elements are additional elements which, taken alone or in combination with the other additional elements, do not integrate the judicial exception into a practical application nor amount to significantly more than the judicial exception:
wherein the respective gating parameters are non-stochastic. (This is the insignificant extra-solution activity of ‘Selecting a particular data source or type of data to be manipulated’. See MPEP 2106.05(g), ‘Selecting a particular data source or type of data to be manipulated’, ex. i-iv.)

Regarding claim 7, the rejection of claim 1 is incorporated herein. The following is an abstract idea: 
identifying the one or more neurons for deletion (Identifying neurons for deletion can be practically performed in the human mind. This is a mental process.)
The following claim elements are additional elements which, taken alone or in combination with the other additional elements, do not integrate the judicial exception into a practical application nor amount to significantly more than the judicial exception:
wherein training the respective layers, … and deleting the identified one or more neurons is performed for more than one of the plurality of training batches (Training in batches is a generic machine learning concept. Iterative pruning during training, meaning that the neurons are deleted for more than one training batch, is also a generic machine learning concept. This amounts to mere instructions to apply an exception.)

Regarding claim 8, the following claim elements are additional elements which, taken alone or in combination with the other additional elements, do not integrate the judicial exception into a practical application nor amount to significantly more than the judicial exception:
One or more non-transitory computer-accessible storage media storing program instructions that when executed on or across one or more computing devices cause the one or more computing devices to implement: (This recites generic computer components and processes. This amounts to mere instructions to apply an exception.)
The remainder of claim 8 recites substantially similar subject matter to claim 1 and is rejected with the same rationale, mutatis mutandis.

Claims 9-14 recite substantially similar subject matter to claims 2-7 respectively and are rejected with the same rationale, mutatis mutandis.

Regarding claim 15, the following claim elements are additional elements which, taken alone or in combination with the other additional elements, do not integrate the judicial exception into a practical application nor amount to significantly more than the judicial exception:
A system, comprising: one or more processors; and a memory storing program instructions that when executed by the one or more processors cause the one or more processors to implement a machine learning system configured to (This recites generic computer components and processes. This amounts to mere instructions to apply an exception.)
The remainder of claim 15 recites substantially similar subject matter to claim 1 and is rejected with the same rationale, mutatis mutandis.

Claims 16-20 recite substantially similar subject matter to claims 2-6 respectively and are rejected with the same rationale, mutatis mutandis.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claim(s) 1, 3-8, 10-15, and 17-20 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Xiao (“AutoPrune: Automatic Network Pruning by Regularizing Auxiliary Parameters”, 2019).

Regarding claim 1, Xiao teaches
	A method comprising: (The abstract states “Our method can automatically eliminate network redundancy with recoverability, relieving the complicated prior knowledge required to design thresholding functions, and reducing the time for trial and error.”)
training a neural network comprising a plurality of neuron layers, the training comprising performing, for a training batch of a plurality of training batches: (Page 7 states “We first use MNIST dataset to evaluate the performance.  Layer structure of LeNet-300-100 is [784, 300, 100, 10] and of LeNet5 is two [20-50] convolution layers, followed by two FC layers. The total number of trainable parameters of LeNet-300-100 and LeNet5 are 267K and 431K, respectively. Similar to previous works, we train reference models with standard training method with SGD optimizer, achieving accuracy of 1.72% and 0.78% respectively.”  Page 6, Algorithm 1 states “12: Sample a mini batch from                         
                            
                                    X
                                
                                    t
                                    r
                                    a
                                    i
                                    n
                                
                    ; 13: Compute                         
                            g
                            r
                            a
                            
                                    d
                                
                                    w
                                
                     with                         
                            
                                    L
                                
                                    1
                                
                    ; 14: Update                         
                            W
                        
                     with                         
                            g
                            r
                            a
                            
                                    d
                                
                                    w
                                
                    ;”. These lines are surrounded by a while loop, meaning the training is performed for a plurality of training batches.)
training respective layers of the plurality of layers according to the training batch, wherein individual ones of the respective layers respectively comprise one or more neurons individually comprising a set of auxiliary parameters and a set of weighting factors, and wherein training individual ones of the respective layers updates the respective sets of auxiliary parameters and the respective sets of weighting factors of the individual ones of the one or more neurons in the respective layers; (Page 6, Algorithm 1 states “12: Sample a mini batch from                         
                            
                                    X
                                
                                    t
                                    r
                                    a
                                    i
                                    n
                                
                    ; 13: Compute                         
                            g
                            r
                            a
                            
                                    d
                                
                                    w
                                
                     with                         
                            
                                    L
                                
                                    1
                                
                    ; 14: Update                         
                            W
                        
                     with                         
                            g
                            r
                            a
                            
                                    d
                                
                                    w
                                
                    ;”. These lines are surrounded by a while loop, meaning the training is performed for a plurality of training batches. Page 3 states "Without losing generality, our method is formulated on weight pruning, but it can be directly extended to neuron pruning." Note that the method is performed for neuron pruning on page 7. Page 3 states "Instead of designing an indicator function for each                         
                            
                                    W
                                
                                    i
                                    j
                                
                     manually, we propose to parameterized a universal indicator function by a set of trainable auxiliary parameters                         
                            M
                        
                    ." Therefore, as one of ordinary skill in the art would understand, when performing this method for neuron pruning, the auxiliary parameters would be for the neurons, which are comprised in layers, instead of the weights. The weights of the neurons are interpreted as the weighting factors. Page 6, Algorithm 1 states “7: while iter!=0 do 8: Sample a minibatch from                         
                            
                                    X
                                
                                    v
                                    a
                                    l
                                
                    ; 9: Compute                         
                            g
                            r
                            a
                            
                                    d
                                
                                    w
                                
                     with                         
                            
                                    L
                                
                                    1
                                
                    ; 10: Compute                         
                            g
                            r
                            a
                            
                                    d
                                
                                    m
                                
                     and                         
                            g
                            r
                            a
                            
                                    d
                                
                                    m
                                    r
                                
                     by Eq. 7; 11: Update                         
                            M
                        
                     with                         
                            g
                            r
                            a
                            
                                    d
                                
                                    m
                                
                     and                         
                            g
                            r
                            a
                            
                                    d
                                
                                    m
                                    r
                                
                    ; 12: Sample a mini batch from                         
                            
                                    X
                                
                                    t
                                    r
                                    a
                                    i
                                    n
                                
                    ; 13: Compute                         
                            g
                            r
                            a
                            
                                    d
                                
                                    w
                                
                     with                         
                            
                                    L
                                
                                    1
                                
                    ; 14: Update                         
                            W
                        
                     with                         
                            g
                            r
                            a
                            
                                    d
                                
                                    w
                                
                    ; 15: Update iter:                         
                            λ
                            ,
                             
                            μ
                        
                     (if scheduling); 16: end while”. Therefore, the weights                         
                            W
                        
                    , interpreted as weighting factors and the auxiliary parameters                         
                            M
                        
                     are updated.)
identifying one or more neurons for deletion according to the respective sets of auxiliary parameters and a regularization penalty for the neural network; and (Page 5 states "In other words, the auxiliary parameter                         
                            
                                    m
                                
                                    i
                                    j
                                
                     tracks the changing of the magnitude of                         
                            
                                    w
                                
                                    i
                                    j
                                
                    . For the pruning task, when the absolute value of a weight/neuron keeps moving towards zero, we should accelerate the pruning process of the weight/neuron." Therefore, the auxiliary parameter identifies neurons for pruning, interpreted as deletion. Page 4 states “The update rule of                         
                            
                                    m
                                
                                    i
                                    j
                                
                     is defined as                         
                            
                                    m
                                
                                    i
                                    j
                                
                            :
                            =
                            
                                    m
                                
                                    i
                                    j
                                
                            -
                            η
                            
                                            ∂
                                            
                                                    L
                                                
                                                    a
                                                    c
                                                    c
                                                
                                            ∂
                                            
                                                    t
                                                
                                                    i
                                                    j
                                                
                                    s
                                    g
                                    n
                                    
                                                    w
                                                
                                                    i
                                                    j
                                                
                                            ∂
                                            h
                                            
                                                            m
                                                        
                                                            i
                                                            j
                                                        
                                            ∂
                                            
                                                    m
                                                
                                                    i
                                                    j
                                                
                            =
                            μ
                            (
                            
                                    ∂
                                    h
                                    
                                                    m
                                                
                                                    i
                                                    j
                                                
                                    ∂
                                    
                                            m
                                        
                                            i
                                            j
                                        
                            )
                            (
                            7
                            )
                        
                     where                         
                            
                                    L
                                
                                    a
                                    c
                                    c
                                
                     denotes                         
                            L
                            (
                            f
                            
                                            x
                                        
                                            i
                                        
                                    ,
                                     
                                    W
                                    
                                        ⨀
                                        
                                            h
                                            
                                                    M
                                                
                                    ,
                                     
                                            y
                                        
                                            i
                                        
                            ,
                             
                            η
                        
                     is the learning rate of                         
                            
                                    m
                                
                                    i
                                    j
                                
                            ,
                             
                                    t
                                
                                    i
                                    j
                                
                            =
                            
                                    w
                                
                                    i
                                    j
                                
                                ⨀
                                
                                    h
                                    (
                                    
                                            m
                                        
                                            i
                                            j
                                        
                                    )
                                
                    , the second term can be considered as the gradient of                         
                            
                                    m
                                
                                    i
                                    j
                                
                            ,
                            
                                            t
                                        
                                            i
                                            j
                                        
                                    ∂
                                    
                                            m
                                        
                                            i
                                            j
                                        
                    , and the third term is related to the sparse regularizer. Therefore, as the auxiliary parameters have a regularizer, the neurons are identified for deletion according to the regularization penalty. As the auxiliary parameters are for the neural network, the regularization penalty is for the neural network.)
deleting the identified one or more neurons from the neural network prior to completion of the training batch. (Page 6, Algorithm 1 states “7: while iter!=0 do 8: Sample a minibatch from                         
                            
                                    X
                                
                                    v
                                    a
                                    l
                                
                    ; 9: Compute                         
                            g
                            r
                            a
                            
                                    d
                                
                                    w
                                
                     with                         
                            
                                    L
                                
                                    1
                                
                    ; 10: Compute                         
                            g
                            r
                            a
                            
                                    d
                                
                                    m
                                
                     and                         
                            g
                            r
                            a
                            
                                    d
                                
                                    m
                                    r
                                
                     by Eq. 7; 11: Update                         
                            M
                        
                     with                         
                            g
                            r
                            a
                            
                                    d
                                
                                    m
                                
                     and                         
                            g
                            r
                            a
                            
                                    d
                                
                                    m
                                    r
                                
                    ; 12: Sample a mini batch from                         
                            
                                    X
                                
                                    t
                                    r
                                    a
                                    i
                                    n
                                
                    ; 13: Compute                         
                            g
                            r
                            a
                            
                                    d
                                
                                    w
                                
                     with                         
                            
                                    L
                                
                                    1
                                
                    ; 14: Update                         
                            W
                        
                     with                         
                            g
                            r
                            a
                            
                                    d
                                
                                    w
                                
                    ; 15: Update iter:                         
                            λ
                            ,
                             
                            μ
                        
                     (if scheduling); 16: end while” Page 4 states “The update rule of                         
                            
                                    m
                                
                                    i
                                    j
                                
                     is defined as                         
                            
                                    m
                                
                                    i
                                    j
                                
                            :
                            =
                            
                                    m
                                
                                    i
                                    j
                                
                            -
                            η
                            
                                            ∂
                                            
                                                    L
                                                
                                                    a
                                                    c
                                                    c
                                                
                                            ∂
                                            
                                                    t
                                                
                                                    i
                                                    j
                                                
                                    s
                                    g
                                    n
                                    
                                                    w
                                                
                                                    i
                                                    j
                                                
                                            ∂
                                            h
                                            
                                                            m
                                                        
                                                            i
                                                            j
                                                        
                                            ∂
                                            
                                                    m
                                                
                                                    i
                                                    j
                                                
                            =
                            μ
                            (
                            
                                    ∂
                                    h
                                    
                                                    m
                                                
                                                    i
                                                    j
                                                
                                    ∂
                                    
                                            m
                                        
                                            i
                                            j
                                        
                            )
                            (
                            7
                            )
                        
                     where                         
                            
                                    L
                                
                                    a
                                    c
                                    c
                                
                     denotes                         
                            L
                            (
                            f
                            
                                            x
                                        
                                            i
                                        
                                    ,
                                     
                                    W
                                    
                                        ⨀
                                        
                                            h
                                            
                                                    M
                                                
                                    ,
                                     
                                            y
                                        
                                            i
                                        
                            ,
                             
                            η
                        
                     is the learning rate of                         
                            
                                    m
                                
                                    i
                                    j
                                
                            ,
                             
                                    t
                                
                                    i
                                    j
                                
                            =
                            
                                    w
                                
                                    i
                                    j
                                
                                ⨀
                                
                                    h
                                    (
                                    
                                            m
                                        
                                            i
                                            j
                                        
                                    )
                                
                    , the second term can be considered as the gradient of                         
                            
                                    m
                                
                                    i
                                    j
                                
                            ,
                            
                                            t
                                        
                                            i
                                            j
                                        
                                    ∂
                                    
                                            m
                                        
                                            i
                                            j
                                        
                    , and the third term is related to the sparse regularizer.” Page 3 states “We relax this problem by introducing a indicator function defined as:                         
                            
                                    h
                                
                                    i
                                    j
                                
                            =
                            
                                                    0
                                                    ,
                                                     
                                                    i
                                                    f
                                                     
                                                            w
                                                        
                                                            i
                                                            j
                                                        
                                                    i
                                                    s
                                                     
                                                    p
                                                    r
                                                    u
                                                    n
                                                    e
                                                    d
                                                
                                                    1
                                                    ,
                                                     
                                                    o
                                                    t
                                                    h
                                                    e
                                                    r
                                                    w
                                                    i
                                                    s
                                                    e
                                                    .
                                                
                    ” Page 3 also states “We also denote the element-wise product                         
                            T
                            =
                            W
                            
                                ⨀
                                
                                    h
                                    
                                            M
                                        
                     as the weight matrix after pruning.” Page 3 states "Without losing generality, our method is formulated on weight pruning, but it can be directly extended to neuron pruning." Note that the method is performed for neuron pruning on page 7. One of ordinary skill in the art would understand that                         
                            w
                             
                            and
                             
                            W
                        
                     would represent neurons for neuron pruning. As the loss function includes                         
                            W
                            
                                ⨀
                                
                                    h
                                    
                                            M
                                        
                    , which represents the pruned neurons, and the loss function is implemented before the completion of the training batch, the neurons are deleted from the neural network prior to completion of the training batch.)

Regarding claim 3, the rejection of claim 1 is incorporated herein. Further, Xiao teaches
deriving respective gating parameters for inputs to respective neurons of the layer according to the respective sets of auxiliary parameters; and (Page 3 states “We relax this problem by introducing a indicator function defined as:                         
                            
                                    h
                                
                                    i
                                    j
                                
                            =
                            
                                                    0
                                                    ,
                                                     
                                                    i
                                                    f
                                                     
                                                            w
                                                        
                                                            i
                                                            j
                                                        
                                                    i
                                                    s
                                                     
                                                    p
                                                    r
                                                    u
                                                    n
                                                    e
                                                    d
                                                
                                                    1
                                                    ,
                                                     
                                                    o
                                                    t
                                                    h
                                                    e
                                                    r
                                                    w
                                                    i
                                                    s
                                                    e
                                                    .
                                                
                    ” Page 3 also states “We also denote the element-wise product                         
                            T
                            =
                            W
                            
                                ⨀
                                
                                    h
                                    
                                            M
                                        
                     as the weight matrix after pruning.” Page 4 states "The indicator function                         
                            
                                    h
                                
                                    i
                                    j
                                
                     contains only zero and one values and thus is non-smooth and nondifferentiable. Inspired by Hubara et al. [2016] where binary weights are represented using step functions and trained with hard sigmoid straight through estimator (STE), we use a simple step function for indicator function                         
                            
                                    h
                                
                                    i
                                    j
                                
                     with trainable parameter                        
                             
                                    m
                                
                                    i
                                    j
                                
                    ." As the auxiliary parameter is                         
                            
                                    m
                                
                                    i
                                    j
                                
                    , the indicator function                         
                            
                                    h
                                
                                    i
                                    j
                                
                    , interpreted as the gating parameter is derived from the auxiliary parameters.)
multiplying the respective gating parameters to the inputs to respective neurons to generate gated inputs to be applied to the respective sets of weighting factors of the neurons. (Page 3 states “We also denote the element-wise product                         
                            T
                            =
                            W
                            
                                ⨀
                                
                                    h
                                    
                                            M
                                        
                     as the weight matrix after pruning.”                         
                            h
                            
                                    M
                                
                     is the gating parameter and                         
                            W
                        
                     is the weighting factor of the neuron. One of ordinary skill in the art would realize that the resulting weight matrix                         
                            T
                        
                     is multiplied by the inputs, meaning that the inputs are also multiplied by the gating parameter.)

Regarding claim 4, the rejection of claim 1 is incorporated herein. Further, Xiao teaches
wherein training the respective layers of the plurality of layers is performed using a stochastic gradient descent technique. (Page 7 states “We first use MNIST dataset to evaluate the performance.  Layer structure of LeNet-300-100 is [784, 300, 100, 10] and of LeNet5 is two [20-50] convolution layers, followed by two FC layers. The total number of trainable parameters of LeNet-300-100 and LeNet5 are 267K and 431K, respectively. Similar to previous works, we train reference models with standard training method with SGD optimizer, achieving accuracy of 1.72% and 0.78% respectively.” One of ordinary skill in the art would realize that SGD stands for stochastic gradient descent.)

Regarding claim 5, the rejection of claim 4 is incorporated herein. Further, Xiao teaches
wherein the stochastic gradient descent technique employs a loss functions comprising a differentiable regularization term which favors a lesser total number of auxiliary parameters in the network. (Page 7 states “We first use MNIST dataset to evaluate the performance.  Layer structure of LeNet-300-100 is [784, 300, 100, 10] and of LeNet5 is two [20-50] convolution layers, followed by two FC layers. The total number of trainable parameters of LeNet-300-100 and LeNet5 are 267K and 431K, respectively. Similar to previous works, we train reference models with standard training method with SGD optimizer, achieving accuracy of 1.72% and 0.78% respectively.” One of ordinary skill in the art would realize that SGD stands for stochastic gradient descent. Pages 3-4 state “The training set will be split into                         
                            
                                    X
                                
                                    t
                                    r
                                    a
                                    i
                                    n
                                
                     and                         
                            
                                    X
                                
                                    v
                                    a
                                    l
                                
                     , and we can further re-formulate the problem from minimizing a single loss function to minimizing the following loss functions iteratively.
                
                            min
                        
                            w
                        
                            L
                        
                            1
                        
                    =
                    
                            min
                        
                            w
                        
                            ∑
                            
                                i
                                =
                                1
                            
                                N
                            
                            L
                            (
                            f
                            
                                            x
                                        
                                            i
                                        
                                    ,
                                     
                                    W
                                    
                                        ⨀
                                        
                                            h
                                            
                                                    M
                                                
                                    ,
                                     
                                            y
                                        
                                            i
                                        
                    +
                    λ
                    R
                    
                            W
                        
                    ,
                     
                            x
                        
                            i
                        
                    ∈
                    
                            X
                        
                            t
                            r
                            a
                            i
                            n
                        
                            4
                        
                            min
                        
                            w
                        
                            L
                        
                            1
                        
                    =
                    
                            min
                        
                            m
                        
                            ∑
                            
                                i
                                =
                                1
                            
                                N
                            
                            L
                            (
                            f
                            
                                            x
                                        
                                            i
                                        
                                    ,
                                     
                                    W
                                    
                                        ⨀
                                        
                                            h
                                            
                                                    M
                                                
                                    ,
                                     
                                            y
                                        
                                            i
                                        
                    +
                    μ
                    R
                    
                            W
                        
                    ,
                     
                            x
                        
                            i
                        
                    ∈
                    
                            X
                        
                            v
                            a
                            l
                        
                    (
                    5
                    )
                    "
                
	Page 6 states “In order to accelerate the pruning process, we bring in regularizers to force the mask values to approach zero. The sparse regularizer is defined as:
                
                    R
                    
                            h
                            
                                    M
                                
                    =
                    
                            ∑
                            
                                i
                                ,
                                j
                            
                            |
                            h
                            (
                            
                                    m
                                
                                    i
                                    j
                                
                            )
                            |
                            =
                            c
                            o
                            u
                            n
                            t
                            
                                    h
                                    
                                            M
                                        
                            .
                        
                    (
                    12
                    )
                
	“Note that the L1 regularizer applied on                         
                            h
                            (
                            M
                            )
                        
                     directly counts the number of gates that are open, which is equivalent to applying L0 regularizer on                         
                            h
                            (
                            M
                            )
                        
                    .” The number of open gates is equivalent to the number of auxiliary parameters, as when the gate is closed, the value is 0, which is multiplied by the respective auxiliary parameter, meaning that the result would also be 0. As the loss function is minimized and the regularization term                         
                            R
                            
                                    h
                                    
                                            M
                                        
                     counts the number of open gates, which is also the count of auxiliary parameters, it favors a lesser total number of auxiliary parameters on the network. As the function will evaluate to a number of gates, the function is differentiable.)

Regarding claim 6, the rejection of claim 4 is incorporated herein. Xiao teaches
wherein the respective gating parameters are non-stochastic. (Page 4 states, in reference to the gating parameters, "The output of each weight is the output of the hard sigmoid binary function." As the gating parameter is the output of a function, the parameters are non-stochastic.

Regarding claim 7, the rejection of claim 1 is incorporated herein. Xiao teaches
wherein training the respective layers, identifying the one or more neurons for deletion and deleting the identified one or more neurons is performed for more than one of the plurality of training batches  (Page 6, Algorithm 1 states “7: while iter!=0 do 8: Sample a minibatch from                         
                            
                                    X
                                
                                    v
                                    a
                                    l
                                
                    ; 9: Compute                         
                            g
                            r
                            a
                            
                                    d
                                
                                    w
                                
                     with                         
                            
                                    L
                                
                                    1
                                
                    ; 10: Compute                         
                            g
                            r
                            a
                            
                                    d
                                
                                    m
                                
                     and                         
                            g
                            r
                            a
                            
                                    d
                                
                                    m
                                    r
                                
                     by Eq. 7; 11: Update                         
                            M
                        
                     with                         
                            g
                            r
                            a
                            
                                    d
                                
                                    m
                                
                     and                         
                            g
                            r
                            a
                            
                                    d
                                
                                    m
                                    r
                                
                    ; 12: Sample a mini batch from                         
                            
                                    X
                                
                                    t
                                    r
                                    a
                                    i
                                    n
                                
                    ; 13: Compute                         
                            g
                            r
                            a
                            
                                    d
                                
                                    w
                                
                     with                         
                            
                                    L
                                
                                    1
                                
                    ; 14: Update                         
                            W
                        
                     with                         
                            g
                            r
                            a
                            
                                    d
                                
                                    w
                                
                    ; 15: Update iter:                         
                            λ
                            ,
                             
                            μ
                        
                     (if scheduling); 16: end while” Page 4 states “The update rule of                         
                            
                                    m
                                
                                    i
                                    j
                                
                     is defined as                         
                            
                                    m
                                
                                    i
                                    j
                                
                            :
                            =
                            
                                    m
                                
                                    i
                                    j
                                
                            -
                            η
                            
                                            ∂
                                            
                                                    L
                                                
                                                    a
                                                    c
                                                    c
                                                
                                            ∂
                                            
                                                    t
                                                
                                                    i
                                                    j
                                                
                                    s
                                    g
                                    n
                                    
                                                    w
                                                
                                                    i
                                                    j
                                                
                                            ∂
                                            h
                                            
                                                            m
                                                        
                                                            i
                                                            j
                                                        
                                            ∂
                                            
                                                    m
                                                
                                                    i
                                                    j
                                                
                            =
                            μ
                            (
                            
                                    ∂
                                    h
                                    
                                                    m
                                                
                                                    i
                                                    j
                                                
                                    ∂
                                    
                                            m
                                        
                                            i
                                            j
                                        
                            )
                            (
                            7
                            )
                        
                     where                         
                            
                                    L
                                
                                    a
                                    c
                                    c
                                
                     denotes                         
                            L
                            (
                            f
                            
                                            x
                                        
                                            i
                                        
                                    ,
                                     
                                    W
                                    
                                        ⨀
                                        
                                            h
                                            
                                                    M
                                                
                                    ,
                                     
                                            y
                                        
                                            i
                                        
                            ,
                             
                            η
                        
                     is the learning rate of                         
                            
                                    m
                                
                                    i
                                    j
                                
                            ,
                             
                                    t
                                
                                    i
                                    j
                                
                            =
                            
                                    w
                                
                                    i
                                    j
                                
                                ⨀
                                
                                    h
                                    (
                                    
                                            m
                                        
                                            i
                                            j
                                        
                                    )
                                
                    , the second term can be considered as the gradient of                         
                            
                                    m
                                
                                    i
                                    j
                                
                            ,
                            
                                            t
                                        
                                            i
                                            j
                                        
                                    ∂
                                    
                                            m
                                        
                                            i
                                            j
                                        
                    , and the third term is related to the sparse regularizer.” Page 3 states “We relax this problem by introducing a indicator function defined as:                         
                            
                                    h
                                
                                    i
                                    j
                                
                            =
                            
                                                    0
                                                    ,
                                                     
                                                    i
                                                    f
                                                     
                                                            w
                                                        
                                                            i
                                                            j
                                                        
                                                    i
                                                    s
                                                     
                                                    p
                                                    r
                                                    u
                                                    n
                                                    e
                                                    d
                                                
                                                    1
                                                    ,
                                                     
                                                    o
                                                    t
                                                    h
                                                    e
                                                    r
                                                    w
                                                    i
                                                    s
                                                    e
                                                    .
                                                
                    ” Page 3 also states “We also denote the element-wise product                         
                            T
                            =
                            W
                            
                                ⨀
                                
                                    h
                                    
                                            M
                                        
                     as the weight matrix after pruning.” Page 3 states "Without losing generality, our method is formulated on weight pruning, but it can be directly extended to neuron pruning." Note that the method is performed for neuron pruning on page 7. One of ordinary skill in the art would understand that                         
                            w
                             
                            and
                             
                            W
                        
                     would represent neurons for neuron pruning. As the loss function includes                         
                            W
                            
                                ⨀
                                
                                    h
                                    
                                            M
                                        
                    , which represents the pruned neurons, and the loss function is implemented before the completion of the training batch, the neurons are deleted from the neural network prior to completion of the training batch. As the identification and deletion of the neurons occurs within the while loop, which has a new training batch each time, it is performed for more than one training batch.)

Regarding claim 8, Xiao teaches
One or more non-transitory computer-accessible storage media storing program instructions that when executed on or across one or more computing devices cause the one or more computing devices to implement: (Page 7 states "Our models are implemented by Tensorflow and run on Ubuntu Linux 16.04 with 32G memory and a single NVIDIA Titan Xp GPU." One of ordinary skill in the art would realize that, in order to perform the method, a non-transitory computer-accessible storage media storing program instructions would be required for the GPU, interpreted as the computing device, to implement the method.)
The remainder of claim 8 recites substantially similar subject matter to claim 1 and is rejected with the same rationale, mutatis mutandis.

Claims 10-14 recite substantially similar subject matter to claims 3-7 respectively and are rejected with the same rationale, mutatis mutandis.

Regarding claim 15, Xiao teaches
A system, comprising: one or more processors; and a memory storing program instructions that when executed by the one or more processors cause the one or more processors to implement a machine learning system configured to (Page 7 states "Our models are implemented by Tensorflow and run on Ubuntu Linux 16.04 with 32G memory and a single NVIDIA Titan Xp GPU." One of ordinary skill in the art would realize that, in order to perform the method, a memory storing program instructions would be required for the GPU, interpreted as the processor, to implement the machine learning method.)
The remainder of claim 15 recites substantially similar subject matter to claim 1 and is rejected with the same rationale, mutatis mutandis.

Claims 17-20 recite substantially similar subject matter to claims 3-6 respectively and are rejected with the same rationale, mutatis mutandis.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claim(s) 2, 9, and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Xiao (“AutoPrune: Automatic Network Pruning by Regularizing Auxiliary Parameters”, 2019), as applied to claims 1, 8, and 15 above, and further in view of Guo (“GDP: Stabilized Neural Network Pruning via Gates with Differentiable Polarization” 2021),

Regarding claim 2, the rejection of claim 1 is incorporated herein. Xiao does not appear to explicitly teach
integrating, subsequent to completion of training of the plurality of training batches, the respective auxiliary parameters into the respective sets of weighting factors for individual ones of the one or more neurons in the respective layers; and
removing the respective auxiliary parameters from the neural network.
However, Guo—directed to analogous art—teaches
integrating, subsequent to completion of training of the plurality of training batches, the respective auxiliary parameters into the respective sets of weighting factors for individual ones of the one or more neurons in the respective layers; and (The abstract states "During the training process, the polarization effect will drive a subset of gates to smoothly decrease to exact zero, while other gates gradually stay away from zero by a large margin. When training terminates, those zero-gated channels can be painlessly removed, while other non-zero gates can be absorbed into the succeeding convolution kernel, causing completely no interruption to training nor damage to the trained model." The gates are interpreted as the auxiliary parameters. One of ordinary skill in the art would realize that absorbing the gates into the kernels means that the weights of the kernel, interpreted as the neuron, will be integrated with the gates.)
removing the respective auxiliary parameters from the neural network. (The abstract states "During the training process, the polarization effect will drive a subset of gates to smoothly decrease to exact zero, while other gates gradually stay away from zero by a large margin. When training terminates, those zero-gated channels can be painlessly removed, while other non-zero gates can be absorbed into the succeeding convolution kernel, causing completely no interruption to training nor damage to the trained model." Absorbing the gates means that the gates, interpreted as the auxiliary parameters, are removed from the neural network.)
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Xiao with the teachings of Guo because, as Guo states on page 5241, "Then we can safely remove channels corresponding to the zero-gates, and absorb the others to the successive convolution kernel. By this way, the sub-net can be got with performance same to the super-net."

	Claims 9 and 16 recite substantially similar subject matter to claim 2 and are rejected with the same rationale, mutatis mutandis.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JESSICA THUY PHAM whose telephone number is (571)272-2605. The examiner can normally be reached Monday - Friday, 9 A.M. - 5:00 P.M..
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li Zhen can be reached at (571) 272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/JESSICA T. PHAM/Examiner, Art Unit 2121                                                                                                                                                                                                        

/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121
Read full office action
Prosecution Timeline

Jan 20, 2023
Application Filed
Jan 20, 2026
Non-Final Rejection — §101, §102, §103 (current)
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds
Prosecution Projections

1-2
Expected OA Rounds
33%
Grant Probability
0%
With Interview (-33.3%)
3y 3m
Median Time to Grant
Low
PTA Risk
Based on 3 resolved cases by this examiner. Grant probability derived from career allow rate.
Self-Pruning Neural Networks with Regularized Auxiliary Variables

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Self-Pruning Neural Networks with Regularized Auxiliary Variables

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email