Last updated: April 19, 2026
Application No. 18/147,297
NEURAL NETWORK DISTILLATION METHOD AND APPARATUS

Non-Final OA §103
Filed
Dec 28, 2022
Examiner
MAHARAJ, DEVIKA S
Art Unit
2123
Tech Center
2100 — Computer Architecture & Software
Assignee
Huawei Technologies Co., Ltd.
OA Round
1 (Non-Final)
Interview Optional

— +7.7% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 78 resolved cases, 2023–2026
Examiner Intelligence

MAHARAJ, DEVIKA S View full profile →
Grants 55% of resolved cases
Career Allow Rate
43 granted / 78 resolved
At TC average
Moderate +8% lift
Without
With
+7.7%
Interview Lift
resolved cases with interview
Typical timeline
5y 0m
Avg Prosecution
28 currently pending
Career history
106
Total Applications
across all art units
Statute-Specific Performance

§101
27.4%
-12.6% vs TC avg
§103
42.8%
+2.8% vs TC avg
§102
10.1%
-29.9% vs TC avg
§112
16.6%
-23.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 78 resolved cases
Office Action

§103
DETAILED ACTION
1.	This communication is in response to the Application No. 18/147,297 filed on December 28, 2022 in which Claims 1-19 are presented for examination. 

Notice of Pre-AIA  or AIA  Status
2.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
3.	The information disclosure statement submitted on 08/27/2024 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 103
4.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

5.	Claims 1-6, 9-11, and 14-17 are rejected under 35 U.S.C. 103 as being unpatentable over Huang et al. (hereinafter Huang) (US PG-PUB 20180365564), in view of Zhu et al. (hereinafter Zhu) (US PG-PUB 20210374506).
Regarding Claim 1, Huang teaches a neural network distillation method (Huang, Abstract, “The method comprises: selecting, by a training device, a teacher network performing the same functions of a student network; and iteratively training the student network and obtaining a target network, through aligning distributions of features between a first middle layer and a second middle layer corresponding to the same training sample data, so as to transfer knowledge of features of a middle layer of the teacher network to the student network.”, thus, a neural network distillation/knowledge transfer method is disclosed), comprising:
obtaining to-be-processed data (Huang, Par. [0050], “As for step 102B in embodiments of the present application, the training sample data used in each iteration of training is different.”, thus, to-be-processed data (training sample data) is obtained), a first neural network having a first neural network layer, and a second neural network having a second neural network layer (Huang, Par. [0069], “In some embodiments of the application, the first specific network layer is a middle network layer or the last network layer of the teacher network; and/or, the second specific network layer is a middle network layer or the last network layer of the student network.”, thus, a first neural network having a first neural network layer and a second neural network having a second neural network layer are obtained);
processing the to-be-processed data by using the first neural network and the second neural network to obtain a first target output and a second target output (Huang, Claim 1, “[…] wherein, the features of the first middle layer refer to feature maps output from a first specific network layer of the teacher network after the training sample data are provided to the teacher network, and the features of the second middle layer refer to feature maps output from a second specific network layer of the student network after the training sample data are provided to the student network.”, thus, the to-be-processed data (training sample data) is processed by using the first neural network and the second neural network, to obtain a first target output (student output/feature map) and second target output (teacher output/feature map)), wherein
the first target output is obtained by performing kernel function-based transformation on an output of the first neural network layer (Huang, Par. [0060-0061], “In some embodiments, the 
    PNG
    media_image1.png
    38
    29
    media_image1.png
    Greyscale
 MMD 2 (FT, FS) in the objective function of the embodiments could be formulated in equation (6) below: […] In equation (6), k(⋅,⋅) refers to a preset kernel function, CT refers to the number of channels of FT, CS refers to the number of channels of FS, fT i⋅ represents the vectorized feature map of the i-th channel of FT, fT i′⋅ represents the vectorized feature map of the i′-th channel of FT, fS j⋅ refers to the vectorized feature map of the j-th channel of FS, and fS J′⋅ refers to the vectorized feature map of the j′-th channel of FS.”, thus, the first target output (student output/feature map) is obtained by performing kernel function-based transformations on an output of the first neural network layer (evaluation of the objective function by the student network – See Par. [0056])), and
the second target output is obtained by performing kernel function-based transformation on an output of the second neural network layer (See introduction of Zhu reference below for teaching of obtaining the second target output by performing kernel function-based transformation on an output of the second neural network layer);
obtaining a target loss based on the first target output and the second target output (Huang, Par. [0057], “In equation (5), 
    PNG
    media_image2.png
    38
    38
    media_image2.png
    Greyscale
(ytrue,pS) refers to the cross-entropy loss function,  
    PNG
    media_image1.png
    38
    29
    media_image1.png
    Greyscale
 MMD 2 (FT,FS) refers to the distance loss function, λ refers to the weight of distance loss function, FT refers to the feature map (i.e. the features of the first middle layer) output from the first specific network layer of the teacher network given the training sample data, FS refers to the feature map (i.e. the features of the second middle layer) output from the second specific network layer of the student network given the training sample data, ytrue refers to the ground-truth labels of the training sample data, and pS refers to the output classification probability of the student network.”, therefore, a cross-entropy loss (in preceding equation (5) within Par. [0056-0057]) is obtained based on the first target output (student output/feature map) and the second target output (teacher output/feature map)); and 
performing knowledge distillation on the first neural network based on at least the target loss and by using the second neural network as a teacher model and the first neural network as a student model to obtain an updated first neural network (Huang, Claim 1, “iteratively training the student network and obtaining a target network, through aligning distributions of features between a first middle layer and a second middle layer corresponding to the same training sample data, so as to transfer knowledge of features of a middle layer of the teacher network to the student network”, therefore, knowledge distillation/knowledge transfer is performed on the first neural network (student network) based on the target loss (See Huang Claim 2) by using the second neural network as a teacher model and the first neural network as a student model to obtain an updated first neural network (target network)).

While Huang discloses performing kernel function-based transformations on an output of the second neural network layer (of the teacher network – See Par. [0060-0061] which disclose the kernel function applied to the teacher output/feature map but this objective function is seemingly only applied to the student/first neural network), Huang does not explicitly disclose the second target output is obtained by performing kernel function-based transformation on an output of the second neural network layer
Zhu teaches the second target output is obtained by performing kernel function-based transformation on an output of the second neural network layer (Zhu, Par. [0027], “In a preferred embodiment of the present invention, further, the proposed optimized target in step (4) includes a maximum average loss of source domains and target domains, and the maximum average loss of source domains and target domains is: […] where K(.,.) represents a Gaussian radial kernel function; h2 s and h2 t respectively represent outputs of source domains and target domains in the hidden layer HL2; and a batch sample quantity of source domains and target domains is m/2”, therefore, the second target output (target domain model output) is obtained by performing a kernel function-based transformation (gaussian radial kernel function) on an output of the second neural network layer).

It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network distillation method, as disclosed by Huang to include where the second target output is obtained by performing kernel function-based transformation on an output of the second neural network layer, as disclosed by Zhu. One of ordinary skill in the art would have been motivated to make this modification to enable the use of a kernel function on an output of the neural network, which may improve feature extraction, hence reducing overfitting and improving model accuracy (Zhu, Par. [0028-0029], “where K(.,.) represents a Gaussian radial kernel function; h2 s and h2 t respectively represent outputs of source domains and target domains in the hidden layer HL2; and a batch sample quantity of source domains and target domains is m/2. In a preferred embodiment of the present invention, further, in step (2), 13 time domain features, 16 time-frequency domain features, and 3 trigonometric function features are extracted.”)

Regarding Claim 2, Huang in view of Zhu teaches the method according to claim 1, further comprising:
obtaining target data; and obtaining a processing result by processing the target data based on the updated first neural network (Huang, Claim 2, “iteratively training the student network by using the training sample data; and, obtaining the target network, when the iteration number of training reaches a threshold or the objective function satisfies the preset convergence conditions.”, therefore, target data is obtained (training sample data) and the target data is processed by the updated first neural network (target network) through iterative training/repeatedly inputting the training sample data and adjusting/updating the student model accordingly).

Regarding Claim 3, Huang in view of Zhu teaches the method according to claim 1, wherein the first neural network layer and the second neural network layer are intermediate layers (Huang, Claim 1, “iteratively training the student network and obtaining a target network, through aligning distributions of features between a first middle layer and a second middle layer corresponding to the same training sample data, so as to transfer knowledge of features of a middle layer of the teacher network to the student network;”, thus, the first neural network layer and the second neural network layer may include middle layers/intermediate layers, located between the input and output layers of the neural network).

Regarding Claim 4, Huang in view of Zhu teaches the method according to claim 1, wherein the target loss is obtained based on a mean square error, relative entropy, a Jensen-Shannon (JS) divergence, or a wasserstein distance of the first target output and the second target output (Huang, Par. [0008], “In Equation (1), 
    PNG
    media_image3.png
    38
    38
    media_image3.png
    Greyscale
 (⋅) represents the cross-entropy loss function, ytrue represents the training label, pS=softmax(lS) refers the classification probability of the student network S, lT and lS refers to the output value of network T and S before softmax (known as logits) respectively, τ is a preset value (called temperature), and λ refers to the weight of transfer loss.”, therefore, the target loss is obtained based on a cross-entropy loss, which directly incorporates relative entropy loss – hence, the target loss is obtained “based on” relative entropy).

Regarding Claim 5, Huang in view of Zhu teaches the method according to claim 1, wherein
the first neural network layer comprises a first weight (Huang, Par. [0078], “Step B: calculating the value of the objective function according to the training sample data in the current iteration and the corresponding features of the first middle layer and the second middle layer thereto, and adjusting the weights of the student network according to the value of the objective function; and, […]”, thus, the first neural network layer of the first neural network (student network) comprises a plurality of weights),
the second neural network layer comprises a second weight (Huang, Par. [0066], “The teacher network is characterized by high performance and high accuracy; but, compared to the student network, it has some obvious disadvantages such as complex structure, a large number of parameters and weights, and low computation speed.”, therefore, the second neural network layer of the second neural network (teacher network) comprises a plurality of weights),
when the to-be-processed data is processed, an input of the first neural network layer includes a first input, and an input of the second neural network layer includes a second input (Huang, Claim 1, “iteratively training the student network and obtaining a target network, through aligning distributions of features between a first middle layer and a second middle layer corresponding to the same training sample data, so as to transfer knowledge of features of a middle layer of the teacher network to the student network […]”, thus, when the to-be-processed data is processed, an input of the first neural network layer includes a first input and the second neural network layer includes a second input, based on the inputted training sample data),
the first target output indicates a distance measure between the first weight mapped to a multidimensional feature space and the first input mapped to the multidimensional feature space, and the second target output indicates a distance measure between the second weight mapped to the multidimensional feature space and the second input mapped to the multidimensional feature space (Huang, Claim 9, “a building module, configured to build an objective function of the student network, where the objective function includes a task specific loss function and a distance loss function, and the distance loss function is a function used to measure the distance between the distributions of the features of the first middle layer and the distributions of the features of the second middle layer corresponding to the same training sample data;”, therefore, the first target output (student output/feature map) indicates a distance measure between the first weight and first input within a feature space/distribution and the second target output (teacher output/feature map) indicates a distance measure between the second weight and second input within a feature space/distribution).

Regarding Claim 6, Huang in view of Zhu teaches the method according to claim 1, wherein a weight distribution of the first neural network is different from a weight distribution of the second neural network (Huang, Par. [0068], “A training element 32, configured to iteratively train the student network and obtain a target network, through aligning distributions of features of a first middle layer and a second middle layer corresponding to the same training sample data, so as to transfer knowledge of features of a middle layer of the teacher network into the student network; wherein, the features of the first middle layer refer to feature maps output from a first specific network layer of the teacher network after the training sample data are provided to the teacher network, and the features of the second middle layer refer to feature maps output from a second specific network layer of the student network after the training sample data are provided to the student network.”, therefore, the weight distribution of the first neural network (student network) is different from a weight distribution of the second neural network (teacher network) – they are aligned through the knowledge transfer process). 

Regarding Claim 9, Huang in view of Zhu teaches the method according to claim 1, wherein
the updated first neural network comprises an updated first neural network layer (Huang, Par. [0053], “Step B: calculating the value of the objective function according to the training sample data in the current iteration and the corresponding features of the first middle layer and the second middle layer thereto, and adjusting the weights of the student network according to the value of the objective function;”, therefore, the updated first neural network (student network) comprises an updated first neural network layer, with weights adjusted according to the value of an objective function), and
when the second neural network and the updated first neural network process same data, a difference between an output of the updated first neural network layer and the output of the second neural network layer falls within a preset range (Huang, Claim 2, “building an objective function of the student network, where the objective function includes a task specific loss function and a distance loss function, and the distance loss function is a function used to measure the distance between the distributions of the features of the first middle layer and the distributions of the features of the second middle layer corresponding to the same training sample data; iteratively training the student network by using the training sample data; and, obtaining the target network, when the iteration number of training reaches a threshold or the objective function satisfies the preset convergence conditions.”, therefore, the second neural network and updated first neural network (student network) process the same data and a difference between layer output is calculated (by the objective function) to determine if the output falls within a preset range (objective function satisfying the preset convergence conditions).

Regarding Claim 10, Huang in view of Zhu teaches the method according to claim 1, wherein obtaining the target loss based on the first target output and the second target output comprises:
obtaining a linearly transformed first target output by performing linear transformation on the first target output; obtaining a linearly transformed second target output by performing linear transformation on the second target output (Huang, Claim 6, “wherein, k(⋅,⋅) refers to a preset kernel function, CT refers to the number of channels of FT, CS refers to the number of channels of FS, fT i⋅ represents the vectorized feature map of the i-th channel of FT, fT i′⋅ represents the vectorized feature map of the i′-th channel of FT, fS j⋅ refers to the vectorized feature map of the j-th channel of FS, and fS j′⋅ refers to the vectorized feature map of the j′-th channel of FS.”, thus, the kernel function may also comprise a linear kernel function (See Huang Claim 7) in which the linearly transformed first/second target output are obtained by applying the linear kernel function to the first/second target output); and
obtaining the target loss based on the linearly transformed first target output and the linearly transformed second target output (Huang, Claims 6 & 7, “The method according to claim 5, wherein 
    PNG
    media_image1.png
    38
    29
    media_image1.png
    Greyscale
  MMD 2 (FT,FS) in the objective function comprises: […] wherein, k(⋅,⋅) refers to a preset kernel function, CT refers to the number of channels of FT, CS refers to the number of channels of FS, fT i⋅ represents the vectorized feature map of the i-th channel of FT, fT i′⋅ represents the vectorized feature map of the i′-th channel of FT, fS j⋅ refers to the vectorized feature map of the j-th channel of FS, and fS j′⋅ refers to the vectorized feature map of the j′-th channel of FS. […] The method according to claim 6, wherein k(⋅,⋅) is a preset linear kernel function, a preset polynomial kernel function, or a preset Gaussian kernel function.”, therefore, the target loss (shown by objective function in Claims 5 and 6) may be obtained based on the linearly transformed first/second target outputs).

Regarding Claim 11, Huang in view of Zhu teaches the method according to claim 1, wherein a kernel function comprises at least one of:
a radial basis kernel function, a Laplacian kernel function, a power index kernel function, an analysis of variance (ANOVA) kernel function, a rational quadratic kernel function, a multiquadric kernel function, an inverse multiquadric kernel function, a sigmoid kernel function, a polynomial kernel function, and a linear kernel function (Huang, Claim 7, “The method according to claim 6, wherein k(⋅,⋅) is a preset linear kernel function, a preset polynomial kernel function, or a preset Gaussian kernel function.”, thus, the kernel function may comprise at least one of a linear/polynomial kernel function. Further, Examiner notes that Zhu (disclosed for teaching part of the kernel function-based transformations in the rejection of Claim 1) also discloses the use of a radial basis kernel function in Zhu Claim 9).
The reasons of obviousness have been noted in the rejection of Claim 1 above and applicable herein.

Regarding Claim 14, Huang teaches a data processing method (Huang, Abstract, “The method comprises: selecting, by a training device, a teacher network performing the same functions of a student network; and iteratively training the student network and obtaining a target network, through aligning distributions of features between a first middle layer and a second middle layer corresponding to the same training sample data, so as to transfer knowledge of features of a middle layer of the teacher network to the student network.”, thus, a neural network distillation/knowledge transfer method is disclosed for data processing), comprising:
obtaining to-be-processed data (Huang, Par. [0050], “As for step 102B in embodiments of the present application, the training sample data used in each iteration of training is different.”, thus, to-be-processed data (training sample data) is obtained), and a first neural network having a first neural network layer and a second neural network layer (Huang, Par. [0069], “In some embodiments of the application, the first specific network layer is a middle network layer or the last network layer of the teacher network; and/or, the second specific network layer is a middle network layer or the last network layer of the student network.”, thus, a first neural network having a first neural network layer and a second neural network having a second neural network layer are obtained), wherein the first neural network is obtained through knowledge distillation by using a second neural network as a teacher model (Huang, Par. [0040], “Step 102: iteratively training the student network and obtaining a target network, through aligning distributions of features between a first middle layer and a second middle layer corresponding to the same training sample data, so as to transfer knowledge of features of a middle layer of the teacher network to the student network.”, thus, the first neural network (student network) is obtained through knowledge distillation/transfer by using a second neural network as a teacher model) ; and
obtaining a processing result by processing the to-be-processed data by using the first neural network output (Huang, Claim 1, “[…] wherein, the features of the first middle layer refer to feature maps output from a first specific network layer of the teacher network after the training sample data are provided to the teacher network, and the features of the second middle layer refer to feature maps output from a second specific network layer of the student network after the training sample data are provided to the student network.”, thus, the to-be-processed data (training sample data) is processed by using the first neural network and the second neural network, to obtain a first target output (student output/feature map) and second target output (teacher output/feature map)), wherein
when the to-be-processed data is processed, a result of performing kernel function-based transformation on an output of the first neural network layer includes a first target output (Huang, Par. [0060-0061], “In some embodiments, the 
    PNG
    media_image1.png
    38
    29
    media_image1.png
    Greyscale
 MMD 2 (FT, FS) in the objective function of the embodiments could be formulated in equation (6) below: […] In equation (6), k(⋅,⋅) refers to a preset kernel function, CT refers to the number of channels of FT, CS refers to the number of channels of FS, fT i⋅ represents the vectorized feature map of the i-th channel of FT, fT i′⋅ represents the vectorized feature map of the i′-th channel of FT, fS j⋅ refers to the vectorized feature map of the j-th channel of FS, and fS J′⋅ refers to the vectorized feature map of the j′-th channel of FS.”, thus, the first target output (student output/feature map) is obtained by performing kernel function-based transformations on an output of the first neural network layer (evaluation of the objective function by the student network – See Par. [0056])),
when the second neural network processes the to-be-processed data, a result of performing kernel function-based transformation on an output of the second neural network layer includes a second target output (See introduction of Zhu reference below for teaching of a second target output obtained by performing a kernel function-based transformation on an output of the second neural network layer), and
a difference between the first target output and the second target output falls within a preset range (Huang, Par. [0057], “In equation (5), 
    PNG
    media_image2.png
    38
    38
    media_image2.png
    Greyscale
(ytrue,pS) refers to the cross-entropy loss function,  
    PNG
    media_image1.png
    38
    29
    media_image1.png
    Greyscale
 MMD 2 (FT,FS) refers to the distance loss function, λ refers to the weight of distance loss function, FT refers to the feature map (i.e. the features of the first middle layer) output from the first specific network layer of the teacher network given the training sample data, FS refers to the feature map (i.e. the features of the second middle layer) output from the second specific network layer of the student network given the training sample data, ytrue refers to the ground-truth labels of the training sample data, and pS refers to the output classification probability of the student network.”, therefore, a cross-entropy loss (in preceding equation (5) within Par. [0056-0057]) is obtained based on the first target output (student output/feature map) and the second target output (teacher output/feature map). Further, it is mentioned in Par. [0049] that the objective function (including the difference) must fall within a preset range/satisfies the preset convergence conditions).

While Huang discloses performing kernel function-based transformations on an output of the second neural network layer (of the teacher network – See Par. [0060-0061] which disclose the kernel function applied to the teacher output/feature map but this objective function is seemingly only applied to the student/first neural network), Huang does not explicitly disclose when the second neural network processes the to-be-processed data, a result of performing kernel function-based transformation on an output of the second neural network layer includes a second target output
However, Zhu teaches when the second neural network processes the to-be-processed data, a result of performing kernel function-based transformation on an output of the second neural network layer includes a second target output (Zhu, Par. [0027], “In a preferred embodiment of the present invention, further, the proposed optimized target in step (4) includes a maximum average loss of source domains and target domains, and the maximum average loss of source domains and target domains is: […] where K(.,.) represents a Gaussian radial kernel function; h2 s and h2 t respectively represent outputs of source domains and target domains in the hidden layer HL2; and a batch sample quantity of source domains and target domains is m/2”, therefore, the second target output (target domain model output) is obtained by performing a kernel function-based transformation (gaussian radial kernel function) on an output of the second neural network layer based on processing data).

It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the data processing method, as disclosed by Huang to include when the second neural network processes the to-be-processed data, a result of performing kernel function-based transformation on an output of the second neural network layer includes a second target output, as disclosed by Zhu. One of ordinary skill in the art would have been motivated to make this modification to enable the use of a kernel function on an output of the neural network, which may improve feature extraction, hence reducing overfitting and improving model accuracy (Zhu, Par. [0028-0029], “where K(.,.) represents a Gaussian radial kernel function; h2 s and h2 t respectively represent outputs of source domains and target domains in the hidden layer HL2; and a batch sample quantity of source domains and target domains is m/2. In a preferred embodiment of the present invention, further, in step (2), 13 time domain features, 16 time-frequency domain features, and 3 trigonometric function features are extracted.”)

Claim 15 recites substantially the same limitations as Claim 3, therefore it is rejected under the same rationale.

Claim 16 recites substantially the same limitations as Claim 5, therefore it is rejected under the same rationale.

Claim 17 recites substantially the same limitations as Claim 6, therefore it is rejected under the same rationale.

6.	Claims 7 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Huang et al. (hereinafter Huang) (US PG-PUB 20180365564), in view of Zhu et al. (hereinafter Zhu) (US PG-PUB 20210374506), further in view of Amizadeh et al. (hereinafter Amizadeh) (US PG-PUB 20190347548).
Regarding Claim 7, Huang in view of Zhu teaches the method according to claim 6.
Huang in view of Zhu does not explicitly disclose wherein the weight distribution of the first neural network is Laplacian distribution, and the weight distribution of the second neural network is Gaussian distribution.
However, Amizadeh teaches wherein the weight distribution of the first neural network is Laplacian distribution, and the weight distribution of the second neural network is Gaussian distribution (Amizadeh, Par. [0042], “1. Propose a plurality of deep neural network (DNN) architectures. The DNN architectures can be sampled from a space of pre-defined architectures or built using general building-blocks. 2. Initialize the weights of the architectures using a Glorot Normal initialization. Independent Normal distributions and Laplace distributions may be used in alternative embodiments.”, thus, the weight distribution of the first/second neural network (of a plurality of neural network architectures) may comprise a Laplacian distribution and a normal/gaussian distribution).

It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network distillation method, as disclosed by Huang in view of Zhu to include wherein the weight distribution of the first neural network is Laplacian distribution, and the weight distribution of the second neural network is Gaussian distribution, as disclosed by Amizadeh. One of ordinary skill in the art would have been motivated to make this modification to enable learning weights of different distributions to improve model robustness and reduce error rate when handling unseen or noisy datasets (Amizadeh, Par. [0049-0050], “In a general version of some aspects, the procedure of Algorithm 1 is repeated until a predetermined desired error rate is reached or a predetermined total computation cost is reached. Algorithm 1 1. Propose a plurality of deep neural network (DNN) architectures. The DNN architectures can be sampled from a space of pre-defined architectures or built using general building-blocks. 2. Initialize the weights of the architectures using a Glorot Normal initialization. Independent Normal distributions and Laplace distributions may be used in alternative embodiments. Algorithm 1 is summarized in FIG. 3. FIG. 3 illustrates a flow chart 300 of an example method for reducing an error rate, in accordance with some embodiments.”).
	
Claim 18 recites substantially the same limitations as Claim 7, therefore it is rejected under the same rationale.

7.	Claims 12-13 are rejected under 35 U.S.C. 103 as being unpatentable over Huang et al. (hereinafter Huang) (US PG-PUB 20180365564), in view of Fukuda et al. (hereinafter Fukuda) (US PG-PUB 20200034702).
Regarding Claim 12, Huang teaches a neural network distillation method applied to a terminal device (Huang, Abstract, “A method and device for training a neural network are disclosed. The method comprises: selecting, by a training device, a teacher network performing the same functions of a student network; and iteratively training the student network and obtaining a target network, through aligning distributions of features between a first middle layer and a second middle layer corresponding to the same training sample data, so as to transfer knowledge of features of a middle layer of the teacher network to the student network.”, therefore a neural network distillation method applied to a terminal device (which may comprise a personal computer, supported by Applicant’s specification Par. [00179]), where the device includes a processor and memory (supported by Huang Par. [0082]) is disclosed), the method comprising: 
obtaining a first neural network and a second neural network (Huang, Claim 1, “selecting, by a training device, a teacher network performing the same functions of a student network”, thus, a first neural network (student network) and a second neural network (teacher network) are obtained);
performing knowledge distillation on the first neural network by using the second neural network as a teacher model and the first neural network as a student model to obtain an updated first neural network (Huang, Claim 1, “iteratively training the student network and obtaining a target network, through aligning distributions of features between a first middle layer and a second middle layer corresponding to the same training sample data, so as to transfer knowledge of features of a middle layer of the teacher network to the student network;”, therefore, knowledge distillation is performed on the first neural network by using the second neural network as a teacher model and the first neural network as a student model to obtain an updated first neural network (target network));

		Huang does not explicitly disclose:
training the second neural network to obtain an updated second neural network; and 
performing knowledge distillation on the updated first neural network by using the updated second neural network as the teacher model and the updated first neural network as the student model to obtain a third neural network.
However, Fukuda teaches:
training the second neural network to obtain an updated second neural network (Fukuda, Par. [0038], “At block 310, a teacher training section, such as the teacher training section 120, may train a plurality of teacher neural networks. The training section may train a plurality of types of teacher neural networks. For example, the training section may train teacher neural networks that have different structures, different layers, different nodes, etc. In an embodiment, the plurality of teacher neural networks may include two or more of: Convolutional Neural Networks (CNN), Visual Geometry Group (VGG) networks, and Long short-term memory (LSTM) networks.”, thus, the second neural network (teacher network) may be trained to obtain an updated second neural network (teacher network)); and 
performing knowledge distillation on the updated first neural network by using the updated second neural network as the teacher model and the updated first neural network as the student model to obtain a third neural network (Fukuda, Par. [0057], “In a specific embodiment, the student training section may repeat iterations at block 340, wherein each iteration includes: inputting the teacher input data into the student neural network, comparing an output data (e.g., soft label output) of the student neural network with the soft label output from the teacher neural network, and adjusting a plurality of weights in the student neural network based on the comparison. The student training section may perform the adjusting, at block 340, by known computer-implemented methods such as back propagation.”, thus, as shown by Figure 3, knowledge distillation is performed on the updated first neural network by using the updated second neural network as the teacher model and the updated first neural network as the student model to obtain a third neural network (updated student network)).

It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network distillation method, as disclosed by Huang to include training the second neural network to obtain an updated second neural network; and performing knowledge distillation on the updated first neural network by using the updated second neural network as the teacher model and the updated first neural network as the student model to obtain a third neural network, as disclosed by Fukuda. One of ordinary skill in the art would have been motivated to make this modification to enable training of the second neural network/teacher network, which may improve knowledge distillation by optimizing the second neural network/teacher network which correspondingly also optimizes the first neural network/student network (Fukuda, Par. [0040], “In a specific embodiment, during the training of a teacher neural network at block 310, the teacher training section may repeat iterations, wherein each iteration includes: inputting the input data into the teacher neural network, comparing an output data (e.g., a soft label output) of the teacher neural network with the corresponding correct training data, and adjusting a plurality of weights between nodes in the teacher neural network based on the comparison. The teacher training section may perform the adjusting by known computer-implemented methods, such as back propagation.”).

Regarding Claim 13, Huang in view of Fukuda teaches the method according to claim 12, wherein training the second neural network to obtain the updated second neural network comprises:
iteratively training the second neural network a plurality of times to obtain the updated second neural network (Fukuda, Par. [0040], “In a specific embodiment, during the training of a teacher neural network at block 310, the teacher training section may repeat iterations, wherein each iteration includes: inputting the input data into the teacher neural network, comparing an output data (e.g., a soft label output) of the teacher neural network with the corresponding correct training data, and adjusting a plurality of weights between nodes in the teacher neural network based on the comparison. The teacher training section may perform the adjusting by known computer-implemented methods, such as back propagation.”, therefore, the second neural network (teacher network) is iteratively trained a plurality of times to obtain the updated second neural network).	
The reasons of obviousness have been noted in the rejection of claim 12 above and applicable herein.

Allowable Subject Matter
8.	No prior art rejection is made for Claims 8 and 19. Claims 8 and 19 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

9.	Examiner notes that Applicant’s instant Claims 8 and 19 recite the use of an “adder neural network”, where the “adder neural network” is described in instant specification Par. [00130-00131] as “The adder neural network is a type of neural network that almost does not include multiplication. Different from a convolutional neural network, the adder neural network uses an L1 distance to measure a correlation between a feature and a filter in the neural network. Because the L1 distance includes only addition and subtraction, a large quantity of multiplication operations in the neural network may be replaced with addition and subtraction. This greatly reduces calculation costs of the neural network. In the ANN, a metric function with only addition, that is, the L1 distance, is usually used to replace convolution calculation in the convolutional neural network.” Further, the closest art that discloses such an “adder neural network” is Chen et al. (“AdderNet: Do We Really Need Multiplications in Deep Learning?”) disclosed by Applicant’s Information Disclosure Statement submitted on 08/27/2024 – however, this art is not eligible prior art, as this NPL shares inventors with the instant application. As such, the prior art of record does not explicitly disclose the specific limitations of Claims 8 and 19, including “The method according to claim 1, wherein the first neural network is an adder neural network (ANN), and the second neural network is a convolutional neural network (CNN)” in combination with the remaining limitations of the Independent claims.

Conclusion
10.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to Devika S Maharaj whose telephone number is (571)272-0829. The examiner can normally be reached Monday - Thursday 8:30am - 5:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Alexey Shmatov can be reached at (571)270-3428. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/DEVIKA S MAHARAJ/Examiner, Art Unit 2123
Read full office action
Prosecution Timeline

Dec 28, 2022
Application Filed
Jan 29, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/655,348
Patent 12585948
NEURAL PROCESSING DEVICE AND METHOD FOR PRUNING THEREOF
2y 5m to grant Granted Mar 24, 2026
17/498,737
Patent 12579426
Training a Neural Network having Sparsely-Activated Sub-Networks using Regularization
2y 5m to grant Granted Mar 17, 2026
17/090,724
Patent 12572795
ANSWER SPAN CORRECTION
2y 5m to grant Granted Mar 10, 2026
17/085,593
Patent 12561577
AUTOMATIC FILTER SELECTION IN DECISION TREE FOR MACHINE LEARNING CORE
2y 5m to grant Granted Feb 24, 2026
17/762,628
Patent 12554969
METHOD AND SYSTEM FOR THE AUTOMATIC SEGMENTATION OF WHITE MATTER HYPERINTENSITIES IN MAGNETIC RESONANCE BRAIN IMAGES
2y 5m to grant Granted Feb 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
55%
Grant Probability
63%
With Interview (+7.7%)
5y 0m
Median Time to Grant
Low
PTA Risk
Based on 78 resolved cases by this examiner. Grant probability derived from career allow rate.