Prosecution Insights
Last updated: May 29, 2026
Application No. 17/801,455

Neighborhood Distillation of Deep Neural Networks

Non-Final OA §103
Filed
Aug 22, 2022
Priority
Jun 23, 2020 — nonprovisional of PCTUS2020039098 +1 more
Examiner
BASOM, BLAINE T
Art Unit
2141
Tech Center
2100 — Computer Architecture & Software
Assignee
Google LLC
OA Round
2 (Non-Final)
43%
Grant Probability
Moderate
2-3
OA Rounds
9m
Est. Remaining
66%
With Interview

Examiner Intelligence

Grants 43% of resolved cases
43%
Career Allowance Rate
140 granted / 326 resolved
-12.1% vs TC avg
Strong +23% interview lift
Without
With
+22.7%
Interview Lift
resolved cases with interview
Typical timeline
4y 6m
Avg Prosecution
23 currently pending
Career history
364
Total Applications
across all art units

Statute-Specific Performance

§101
1.1%
-38.9% vs TC avg
§103
85.8%
+45.8% vs TC avg
§102
1.0%
-39.0% vs TC avg
§112
2.6%
-37.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 326 resolved cases

Office Action

§103
DETAILED ACTION This Office Action is responsive to the Applicant’s submission, filed on September 25, 2025, amending claims 1 and 11. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention. Claims 1, 4, 6, 11, 14 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over the article entitled, “Progressive Blockwise Knowledge Distillation for Neural Network Acceleration” by Wang et al. (“Wang”), over the article entitled “Blockwisely Supervised Neural Architecture Search with Knowledge Distillation” by Li et al. (“Li”), and also over Applicant’s Admitted Prior Art (“AAPA”). Regarding claim 1, Wang describes “a progressive blockwise learning scheme for teacher-student model distillation at the subnetwork block level,” wherein “[t]he proposed scheme is able to distill the knowledge of the entire teacher network by locally extracting the knowledge of each block in terms of progressive blockwise function approximation.” (Abstract). Like claimed, Wang particularly teaches: dividing a first neural network into a plurality of neighborhoods (Wang discloses that the progressive blockwise learning scheme “converts a sequence of teacher subnetwork blocks into a sequence of student subnetwork blocks after progressive blockwise optimization.” (Section I “Introduction”). In particular, Wang discloses that a teacher neural network can be divided into a number of subnetwork blocks: A neural network is mainly comprised of the convolution layers, the pooling layers, and the fully connected layers. The subnetwork between two adjacent pooling layers is defined as a subnetwork block. Let a complicated network T be the teacher network, which is composed of N subnetwork blocks: T = c ∘ t N ∘ t N - 1 ∘ ⋅ ⋅ ⋅ ∘ t 1   (1) Where t i   i ∈ 1,2 , … , N is the mapping function of the i -th block in the sequence and c is the mapping function of the classifier. To simplify the representation of the network, we shorten it as: ∏ i = 1 N ∘ t i =   t N ∘ t N - 1 ∘ ⋅ ⋅ ⋅ ∘ t 1 (2) Therefore, T is rewritten as: T   = c ∘ ∏ i = 1 N ∘ t i (3) The parameters of the teacher network are denoted as: W T = W c ,   W t N ,   W t N - 1 , … , W t 1 (4) Where W c and W t i   i ∈ 1,2 , … , N are the parameters of c and t i . (Section 2.1 “Problem Definition”; emphasis added). In this paper, we utilize the widely used network VGG-16 [Simonyan and Zisserman, 2015] as the teacher network sample. We divide the VGG-16 network into five blocks using the (pooling layers as the boundaries. For each block, we use our block design criterion described in Section 2.3 to obtain the student subnetwork block. (Section 3.2 “Implementation Details”; emphasis added). The teacher neural network described by Wang is considered a first neural network like claimed, and each subnetwork block of the divided teacher neural network is considered a “neighborhood” like claimed.); for each given neighborhood of the plurality of neighborhoods: generating, by one or more processors of a processing system, a candidate student model (Wang discloses that a student neural network is designed to include a number of subnetwork blocks, each corresponding to a respective subnetwork block of the teacher neural network: Our objective is to design a student network with high computational efficiency and low memory usage, and to learn the corresponding optimal parameters. The student network composed of N student subnetwork blocks can be written as: S = c ∘ ∏ i = 1 N ∘ s i (5) where s i denotes a student subnetwork block. The corresponding optimal parameters of the student network are denoted as: W S = W c ,   W S N ,   W S N - 1 , … , W S 1 (6) In essence, the problem is to design N student subnetwork blocks sequence S and optimize the corresponding parameters W S using the prior knowledge of N teacher subnetwork blocks sequence T : c ∘ ∏ i = 1 N ∘ t i x ;   W T o p t i m i z e   W s → d e s i g n   S   c ∘ ∏ i = 1 N ∘ S i x ;   W S (7) (Section 2.1 “Problem Definition”). We propose a blockwise design scheme. Based on our blockwise design scheme, the student subnetwork block we design should not change the input/output size of the next/previous block and maintain the receptive field at the same time. In addtion, this student subnetwork block is based on the teacher subnetwork block but contains fewer parameters and FLOPs (floating point operations). (Section 2.3 “Student Subnetwork Block Design”). The respective student subnetwork block generated for each teacher subnetwork block, i.e. neighborhood, is considered a “candidate student model” like claimed. Wang suggests that this task is implemented by one or more processors of a processing system, as Wang discloses, “[w]e implement our architecture using Caffe [Jia et al., 2014] and use an NVIDIA TITAN X GPU to train the network.” Section 3.2 “Implementation Details.”); receiving, by the one or more processors, a first output from the given neighborhood, the first output having been produced by the given neighborhood based on an input (Wang discloses that the plurality of subnetwork blocks of the student model are learned in a plurality of stages, wherein each stage optimizes a respective student subnetwork block while keeping the other blocks fixed: To reduce the optimization difficulty described in Eq. (7), we propose a progressive blockwise learning scheme. As shown in Fig. 1, our blockwise learning scheme learns the sequence of student subnetwork blocks by N block learning stages, and only optimizes one block at each learning stage while keeping the other blocks fixed. To better introduce our blockwise scheme, we use the auxiliary function A k   ( k ∈ { 0,1 , … , N } ) to represent our intermediate network at the k-th block learning stage: A k = c ∘ ( ∏ i = k + 1 N ∘ t i ) ∘ ( ∏ j = 1 k ∘ s j ) (8) Where s j is the optimized student network block and t i is the teacher network block. The parameters of A k are denoted as below: W A k = { W c , W t N , … , W t k + 1 , W s k , W s k - 1 , … , W s 1 } (9) As can be noted from the description of A k , A 0 is the teacher network T and A N is the optimized network S. Hence, the problem defined in Eq. (7) can be solved as below: A 0 → s t a g e   1 A 1 → s t a g e   2 A 2 → … A k - 1 → s t a g e   k A k → …   A N (10) (Section 2.2 “Progressive Blockwise Learning”; emphasis added). Wang particularly discloses that, at each learning stage, the parameters of a student subnetwork block are updated by using a loss function that is based, in part, on the output of the student subnetwork block compared with the output of its corresponding block in the teacher network, when both the student and teacher networks are given the same input: More specifically, we use the teacher-student learning strategy at each block learning stage. At the block learning stage 1, we consider A 0 / A 1 as the teacher/student network, and then learn A 1 from A 0 . Similarly, we learn A 2 from A 1 at the block learning stage 2. By analogy, using such a progressive way, we will eventually solve the problem in Eq. (10). In order to solve the teacher-student network optimization problem at each block learning stage, we use two terms to compose the objective loss function. Taking the k-th block learning stage for example, the first term of the loss compares the output of the block s k with its corresponding block in the teacher network t k , and is formulated as: L l o c a l k I ; W s k =   1 2 ( t k ∘ ∏ i = 1 k - 1 ∘ s i ) ( I ; W t k ∪   { W s i } i = 1 k - 1 ) - ( ∏ i = 1 k ∘ s i ) ( I ; { W s i } i = 1 k - 1 F 2 (11) where I is the input of the network (e.g., image) and ∙ F denotes the Frobenius norm. The second classification loss term L c l s k is to make the output of the student network approximate the ground truth, which is defined as: L c l s k I , y ; W s k = s o f t m a x ( A k I ; W A k , y ) (12) where y is the ground truth for I and softmax(.) means the softmax loss between the network’s output and y as described in [Simonyan and Zisserman, 2015] Then the objective loss function for one training sample (I, y) at the k-th block learning stage is as below: L k I , y ; W s k =   λ l o c a l L l o c a l k I ; W s k +   L c l s k ( I , y ; W s k ) (13) where λ l o c a l is a hyper-parameter to balance these two terms of the loss function. Therefore, our objective loss function for the training data { I 1 , y 1 , … , I M , y M } is: L k W s k =   1 M ∑ m = 1 M L k ( I m ,   y m ; W s k ) (14) By optimizing this loss function, the student subnetwork block can be trained under the ground truths and knowledge of the teacher subnetwork block. The details of our progressive blockwise learning scheme are shown in Alg. 1. (Section 2.2 “Progressive Blockwise Learning”). PNG media_image1.png 317 314 media_image1.png Greyscale (Emphasis added). Using such a loss function would thus entail receiving a first output from the corresponding teacher block, i.e. neighborhood, the first output having been produced by the block based on an input. Wang suggests that this task is implemented by one or more processors of a processing system, as Wang discloses, “[w]e implement our architecture using Caffe [Jia et al., 2014] and use an NVIDIA TITAN X GPU to train the network.” Section 3.2 “Implementation Details.”); receiving, by the one or more processors, a second output, the second output having been produced by the candidate student model based on the input (As noted above, Wang discloses that the plurality of subnetwork blocks of the student model are learned in a plurality of stages; at each learning stage, the parameters of a student subnetwork block are updated by using a loss function that is based, in part, on the output of the student subnetwork block compared with the output of its corresponding block in the teacher network, when both the student and teacher networks are given the same input. Using such a loss function would thus entail receiving an output produced by the student subnetwork block, i.e. the candidate student model, based on the input. Wang suggests that this task is implemented by one or more processors of a processing system, as Wang discloses, “[w]e implement our architecture using Caffe [Jia et al., 2014] and use an NVIDIA TITAN X GPU to train the network.” Section 3.2 “Implementation Details.”); comparing, by the one or more processors, the first output to the second output to generate a first training gradient corresponding to the candidate student model (As noted above, Wang discloses that at each learning stage, the parameters of a given student subnetwork block are updated by using a loss function that is based, in part, on the output of the student subnetwork block compared with the output of its corresponding block in the teacher network. A first training gradient, i.e. an amount to update the parameters of the student subnetwork block, is thus understandably generated based on this loss function, which is indicative of a comparison of the first output to the second output. Wang suggests that this task is implemented by one or more processors of a processing system, as Wang discloses, “[w]e implement our architecture using Caffe [Jia et al., 2014] and use an NVIDIA TITAN X GPU to train the network.” Section 3.2 “Implementation Details.”); modifying, by the one or more processors, one or more parameters of the candidate student model based at least in part on the first training gradient corresponding to the candidate student model (As noted above, Wang discloses that at each learning stage, the parameters of a given student subnetwork block are updated by using a loss function that is based, in part, on the output of the student subnetwork block compared with the output of its corresponding block in the teacher network. The parameters of the student subnetwork block are thus modified based at least in part on the first training gradient, i.e. an amount based on the loss function. Wang suggests that this task is implemented by one or more processors of a processing system, as Wang discloses, “[w]e implement our architecture using Caffe [Jia et al., 2014] and use an NVIDIA TITAN X GPU to train the network.” Section 3.2 “Implementation Details.”); and identifying, by the one or more processors, a selected model, the selected model being a copy of the candidate student model or a copy of the given neighborhood (As noted above, Wang discloses that the plurality of subnetwork blocks of the student model are learned in a plurality of stages, wherein each stage optimizes a respective student subnetwork block, i.e. a respective candidate student model. At the end of the training, a student model comprising all the learned subnetwork blocks of the student is returned. See e.g. “Algorithm 1,” which is excerpted above. Returning the student model would entail identifying a selected model, i.e. a learned student subnetwork block, for each block of the student model, the selected model being a copy of a candidate student model. Wang suggests that this task is implemented by one or more processors of a processing system, as Wang discloses, “[w]e implement our architecture using Caffe [Jia et al., 2014] and use an NVIDIA TITAN X GPU to train the network.” Section 3.2 “Implementation Details.”); and combining, by the one or more processors, the selected model corresponding to each given neighborhood of the plurality of neighborhoods to form a second neural network (As noted above, Wang discloses that the student neural network is designed to include a number of subnetwork blocks, each corresponding to a respective subnetwork block of the teacher neural network. As further noted above, Wang discloses that the plurality of subnetwork blocks of the student model are learned in a plurality of stages, wherein at the end of the training, a student model comprising all the learned subnetwork blocks of the student is returned. The student model is a combination of the selected models, i.e. learned student subnetwork blocks, each corresponding to a given neighborhood of the plurality of neighborhoods, i.e. corresponding to a subnetwork block of the plurality of subnetwork blocks of the teacher model. The returned student model is considered a second neural network like claimed. Wang suggests that this task is implemented by one or more processors of a processing system, as Wang discloses, “[w]e implement our architecture using Caffe [Jia et al., 2014] and use an NVIDIA TITAN X GPU to train the network.” Section 3.2 “Implementation Details.”). Accordingly, Wang teaches a method similar to that of claim 1, which is for using a first neural network to generate a second neural network. Wang, however, does not disclose that a plurality of candidate student models are generated for each given neighborhood, wherein each of the plurality of candidate student models is trained (i.e. by producing a second output based on the input, by comparing the second output to the first output to generate a first gradient, and by modifying parameters of the model based on the gradient), and wherein the selected model for each given neighborhood is a copy of one of the plurality of candidate student models or a copy of the given neighborhood, as is required by claim 1. Moreover, Wang does not explicitly disclose that the first neural network has a first architecture configured to run on a first hardware type, and that each candidate student model of the plurality of candidate student models has a second architecture configured to run on a second hardware type different from the first hardware type, as is further required by claim 1. Li nevertheless generally teaches training a plurality of candidate models (i.e. candidate architectures) for each block of a neural network architecture: To address the above-mentioned issues, we propose a new solution to NAS [Neural Architecture Search] where the search space is large, while the potential candidate architectures can be fully and fairly trained. We consider a network architecture that has several blocks, conceptualized as analogous to the ventral visual blocks V1, V2, V4, and IT [25] (see Fig. 1). We then train each block of the candidate architectures separately. As guaranteed by the mathematical principle, the number of candidate architectures in a block reduces exponentially compared to the the number of candidates in the whole search space. Hence, the architecture candidates can be fully and fairly trained, while the representation shift caused by the shared parameters is reduced, leading to the correct candidate ratings. The correct and visiting-all evaluation improves the effectiveness of NAS. Moreover, thanks to the modest amount of the candidates in a block, we can even search for the depth of a block, which further improves the performance of NAS. (Section 1 “Introduction”; emphasis added). Specifically, a “supernet” of candidate architectures is divided into blocks, each block having a plurality of candidate architectures, and wherein each block of candidates corresponds to a block of a teacher network that is used to train the block of candidates: Block-wise NAS. [10] and [17] have suggested that when the search space is small, and all the candidates are fully and fairly trained, the evaluation could be accurate. To improve the accuracy of the evaluation, we divide the supernet into blocks of smaller sub-space. Specifically, Let N denote the supernet. We divide N into N blocks by the depth of the supernet and have: N =   N N … N i + 1 ∘ N i ∙ ∙ ∙ ∘ N 1 , (2) where N i + 1 ∘ N i denotes that the (i + 1)-th block is originally connected to the i-th block in the supernet. Then we learn each block of the supernet separately using: W i * =   min W i ⁡ L t r a i n W i ,   A i ,   X ,   Y i ,           i = 1,2 , ∙ ∙ ∙ , N , (3) where A i denote the search space in the i-th block. (Section 3.1 “Challenge of NAS and our Block-wise Search”; emphasis added). Although we motivate well in Section 3.1, a technical barrier in our block-wise NAS is that we lack of internal ground truth in Eqn. (3). Fortunately, we find that different blocks of an existing architecture have different knowledge in extracting different patterns of an image. We also find that the knowledge not only lies, as the literature suggests, in the network parameters, but also in the network architecture. Hence, we use the block-wise representation of existing models to supervise our architecture search. Let Y i be the output feature maps of the i-th block of the supervising model (i.e., teacher model) and Y ^ i ( x ) be the output feature maps of the i-th block of the supernet. We take L2 norm as the cost function. The loss function in Eqn. (3) can be written as: L t r a i n W i ,   A i ,   X ,   Y i =   1 K Y i -   Y ^ i ( x ) 2 2 , (6) where K denotes the numbers of the neurons in Y. (Section 3.2 “Block-wise Supervision with Distilled Architecture Knowledge”; footnote omitted and emphasis added). Li further suggests that a top candidate from each block is then selected to assemble a best student model: Evaluation. In our method, we aim to imitate the behavior of the teacher in every block. Thus, we estimate the learning ability of a student sub-model by its evaluation loss in each block. Our block-wise search make it possible to evaluate all the partial models (about 104 in each cell). To accelerate this process, we forward-propagate a batch of input node by node in a manner similar to deep first search, with intermediate output of each node saved and reused by subsequent nodes to avoid recalculating it from the beginning. The feature sharing evaluation algorithm is outlined in Algorithm 1. By evaluating all cells in a block of the supernet, we can get the evaluation loss of all possible paths in one block. We can easily sort this list with about 104 elements in a few seconds with a single CPU. After this, we can select the top-1 partial model from every block to assemble a best student. However, we still need to find efficient models under different constraints to meet the needs of real-life applications. (Section 3.4 “Search for the Best Student Under Constraint”; emphasis added). Li thus generally teaches generating and training a plurality of candidate student models for each block (i.e. neighborhood) of a teacher neural network model, and selecting a top candidate model from the plurality of candidate models for each block, whereby the selected candidate models are combined to form a student neural network. It would have been obvious to one of ordinary skill in the art, having the teachings of Wang and Li before the effective filing date of the claimed invention, to modify the method taught by Wang so as to generate a plurality of candidate student models for each block (i.e. neighborhood) of the teacher neural network (i.e. first neural network) like taught by Li, wherein each of the plurality of candidate student models is trained (i.e. by, in part, producing a second output based on the input, by comparing the second output to the first output of the corresponding teacher block to generate a first gradient, and by modifying parameters of the candidate student model based on the gradient like taught by Wang as noted above) and a top candidate model is selected from the plurality of candidate models for each block, whereby the top candidate models are combined to form the student neural network (i.e. second neural network). It would have been advantageous to one of ordinary skill to utilize such a combination because it would provide for the identification of an optimal architecture (and not just an optimal set of parameters) for the student neural network, as is suggested by Li (see e.g. section 1 “Introduction,” which states “[w]e also find that the knowledge not only lies, as the literature suggests, in the network parameters, but also in the network architecture… We have searched a number of architectures that have fewer parameters but significantly outperforms the supervising model, demonstrating the practicability and scalability of our DNA method.”). The “background” section of the instant Application’s specification discloses AAPA, particularly that knowledge distillation is traditionally employed to transfer knowledge from a first neural network that has a first architecture to a plurality of second candidate neural networks having a second architecture, wherein the first architecture is configured to run on a first hardware type and the second architecture is configured to run on a second hardware type different from the first hardware type: [0001] Knowledge distillation is a general model compression/optimization method that transfers the knowledge of a large teacher network (or set thereof) to a smaller student network, or from a network whose architecture is suited to run on one type of hardware to a network whose architecture is suited to run on a different type of hardware. However, in traditional end-to-end distillation, candidate student networks (which are usually a variant of the teacher network with a smaller number of layers and/or parameters, or with an architecture that is suited to a different type of hardware) must be individually trained to mimic the output of the teacher network, and then compared to one another in order to choose which student network is best in terms of complexity and/or accuracy. Because some layers or groups of layers in a deep neural network will be harder to distill than others, finding the ideal architecture for the student network can require consideration of a large number of candidate student networks, and thus can be both computationally expensive and time-consuming. (Applicant’s Specification, paragraph 0001. Emphasis Added.) It would have been obvious to one of ordinary skill in the art, having the teachings of Wang, Li and AAPA before the effective filing date of the claimed invention, to modify the method taught by Wang and Li such that the first neural network has a first architecture configured to run on a first hardware type, and such that each candidate student model of the plurality of candidate student models has a second architecture configured to run on a second hardware type different from the first hardware type, as is disclosed by AAPA. It would have been advantageous to one of ordinary skill to utilize such a combination, because the resulting neural network would have the knowledge/functionality of the teacher neural network (i.e. the first neural network), but would be better able to execute on the second hardware type, as is evident from AAPA (see paragraph 0001 of Applicant’s specification). Accordingly, Wang, Li and AAPA are considered to teach, to one of ordinary skill in the art, a method like that of claim 1, which is for using a first neural network to generate a second neural network. As per claim 4, it would have been obvious, as is described above, to modify the method taught by Wang so as to generate a plurality of candidate student models for each block (i.e. neighborhood) of the teacher neural network (i.e. first neural network) like taught by Li, wherein each of the plurality of candidate student models is trained and a top candidate model is selected from the plurality of candidate models for each block. Li suggests that the selected model (i.e. the top candidate model) is identified based at least in part on a comparison of a measurement (i.e. an “evaluation loss”) of how closely each candidate student model of the plurality of candidate student models approximates the output of the given neighborhood: Evaluation. In our method, we aim to imitate the behavior of the teacher in every block. Thus, we estimate the learning ability of a student sub-model by its evaluation loss in each block. Our block-wise search make it possible to evaluate all the partial models (about 104 in each cell). To accelerate this process, we forward-propagate a batch of input node by node in a manner similar to deep first search, with intermediate output of each node saved and reused by subsequent nodes to avoid recalculating it from the beginning. The feature sharing evaluation algorithm is outlined in Algorithm 1. By evaluating all cells in a block of the supernet, we can get the evaluation loss of all possible paths in one block. We can easily sort this list with about 104 elements in a few seconds with a single CPU. After this, we can select the top-1 partial model from every block to assemble a best student. However, we still need to find efficient models under different constraints to meet the needs of real-life applications. (Section 3.4 “Search for the Best Student Under Constraint”; emphasis added). Accordingly, the above-described combination of Wang, Li and AAPA is further considered to teach a method like that of claim 4. As per claim 6, Wang teaches that the input (i.e. the input to the teacher block/neighborhood and to the candidate student block/model) comprises an output received from a neighborhood (i.e. block) preceding the given neighborhood (see e.g. section 2.2 “Progressive Blockwise Learning;” equation (11) therein demonstrates that the input to teacher block t k and student block s k comprises the output of student block s k - 1 ). Li provides a similar teaching (see e.g. section 3.2 “Block-wise Supervision with Distilled Architecture Knowledge,” which states “[s]pecifically, for each block, we use the output Y i - 1 of the (i − 1)-th block of the teacher model as the input of the i-th block of the supernet.”). Accordingly, the above-described combination of Wang, Li and AAPA is further considered to teach a method like that of claim 6. Regarding claim 11, and like noted above, Wang describes “a progressive blockwise learning scheme for teacher-student model distillation at the subnetwork block level,” wherein “[t]he proposed scheme is able to distill the knowledge of the entire teacher network by locally extracting the knowledge of each block in terms of progressive blockwise function approximation.” (Abstract). Like claimed, Wang particularly teaches: for each given neighborhood of a plurality of neighborhoods, each given neighborhood comprising a piece of a first neural network: (Wang discloses that the progressive blockwise learning scheme “converts a sequence of teacher subnetwork blocks into a sequence of student subnetwork blocks after progressive blockwise optimization.” (Section I “Introduction”). In particular, Wang discloses that a teacher neural network can be divided into a number of subnetwork blocks: A neural network is mainly comprised of the convolution layers, the pooling layers, and the fully connected layers. The subnetwork between two adjacent pooling layers is defined as a subnetwork block. Let a complicated network T be the teacher network, which is composed of N subnetwork blocks: T = c ∘ t N ∘ t N - 1 ∘ ⋅ ⋅ ⋅ ∘ t 1   (1) Where t i   i ∈ 1,2 , … , N is the mapping function of the i -th block in the sequence and c is the mapping function of the classifier. To simplify the representation of the network, we shorten it as: ∏ i = 1 N ∘ t i =   t N ∘ t N - 1 ∘ ⋅ ⋅ ⋅ ∘ t 1 (2) Therefore, T is rewritten as: T   = c ∘ ∏ i = 1 N ∘ t i (3) The parameters of the teacher network are denoted as: W T = W c ,   W t N ,   W t N - 1 , … , W t 1 (4) Where W c and W t i   i ∈ 1,2 , … , N are the parameters of c and t i . (Section 2.1 “Problem Definition”; emphasis added). In this paper, we utilize the widely used network VGG-16 [Simonyan and Zisserman, 2015] as the teacher network sample. We divide the VGG-16 network into five blocks using the (pooling layers as the boundaries. For each block, we use our block design criterion described in Section 2.3 to obtain the student subnetwork block. (Section 3.2 “Implementation Details”; emphasis added). The teacher neural network described by Wang is considered a first neural network like claimed, and each subnetwork block of the divided teacher neural network is considered a “neighborhood” like claimed.): generating a candidate student model (Wang discloses that a student neural network is designed to include a number of subnetwork blocks, each corresponding to a respective subnetwork block of the teacher neural network: Our objective is to design a student network with high computational efficiency and low memory usage, and to learn the corresponding optimal parameters. The student network composed of N student subnetwork blocks can be written as: S = c ∘ ∏ i = 1 N ∘ s i (5) where s i denotes a student subnetwork block. The corresponding optimal parameters of the student network are denoted as: W S = W c ,   W S N ,   W S N - 1 , … , W S 1 (6) In essence, the problem is to design N student subnetwork blocks sequence S and optimize the corresponding parameters W S using the prior knowledge of N teacher subnetwork blocks sequence T : c ∘ ∏ i = 1 N ∘ t i x ;   W T o p t i m i z e   W s → d e s i g n   S   c ∘ ∏ i = 1 N ∘ S i x ;   W S (7) (Section 2.1 “Problem Definition”). We propose a blockwise design scheme. Based on our blockwise design scheme, the student subnetwork block we design should not change the input/output size of the next/previous block and maintain the receptive field at the same time. In addtion, this student subnetwork block is based on the teacher subnetwork block but contains fewer parameters and FLOPs (floating point operations). (Section 2.3 “Student Subnetwork Block Design”). The respective student subnetwork block generated for each teacher subnetwork block, i.e. neighborhood, is considered a “candidate student model” like claimed.); receiving a first output from the given neighborhood, the first output having been produced by the given neighborhood based on an input (Wang discloses that the plurality of subnetwork blocks of the student model are learned in a plurality of stages, wherein each stage optimizes a respective student subnetwork block while keeping the other blocks fixed: To reduce the optimization difficulty described in Eq. (7), we propose a progressive blockwise learning scheme. As shown in Fig. 1, our blockwise learning scheme learns the sequence of student subnetwork blocks by N block learning stages, and only optimizes one block at each learning stage while keeping the other blocks fixed. To better introduce our blockwise scheme, we use the auxiliary function A k   ( k ∈ { 0,1 , … , N } ) to represent our intermediate network at the k-th block learning stage: A k = c ∘ ( ∏ i = k + 1 N ∘ t i ) ∘ ( ∏ j = 1 k ∘ s j ) (8) Where s j is the optimized student network block and t i is the teacher network block. The parameters of A k are denoted as below: W A k = { W c , W t N , … , W t k + 1 , W s k , W s k - 1 , … , W s 1 } (9) As can be noted from the description of A k , A 0 is the teacher network T and A N is the optimized network S. Hence, the problem defined in Eq. (7) can be solved as below: A 0 → s t a g e   1 A 1 → s t a g e   2 A 2 → … A k - 1 → s t a g e   k A k → …   A N (10) (Section 2.2 “Progressive Blockwise Learning”; emphasis added). Wang particularly discloses that, at each learning stage, the parameters of a student subnetwork block are updated by using a loss function that is based, in part, on the output of the student subnetwork block compared with the output of its corresponding block in the teacher network, when both the student and teacher networks are given the same input: More specifically, we use the teacher-student learning strategy at each block learning stage. At the block learning stage 1, we consider A 0 / A 1 as the teacher/student network, and then learn A 1 from A 0 . Similarly, we learn A 2 from A 1 at the block learning stage 2. By analogy, using such a progressive way, we will eventually solve the problem in Eq. (10). In order to solve the teacher-student network optimization problem at each block learning stage, we use two terms to compose the objective loss function. Taking the k-th block learning stage for example, the first term of the loss compares the output of the block s k with its corresponding block in the teacher network t k , and is formulated as: L l o c a l k I ; W s k =   1 2 ( t k ∘ ∏ i = 1 k - 1 ∘ s i ) ( I ; W t k ∪   { W s i } i = 1 k - 1 ) - ( ∏ i = 1 k ∘ s i ) ( I ; { W s i } i = 1 k - 1 F 2 (11) where I is the input of the network (e.g., image) and ∙ F denotes the Frobenius norm. The second classification loss term L c l s k is to make the output of the student network approximate the ground truth, which is defined as: L c l s k I , y ; W s k = s o f t m a x ( A k I ; W A k , y ) (12) where y is the ground truth for I and softmax(.) means the softmax loss between the network’s output and y as described in [Simonyan and Zisserman, 2015] Then the objective loss function for one training sample (I, y) at the k-th block learning stage is as below: L k I , y ; W s k =   λ l o c a l L l o c a l k I ; W s k +   L c l s k ( I , y ; W s k ) (13) where λ l o c a l is a hyper-parameter to balance these two terms of the loss function. Therefore, our objective loss function for the training data { I 1 , y 1 , … , I M , y M } is: L k W s k =   1 M ∑ m = 1 M L k ( I m ,   y m ; W s k ) (14) By optimizing this loss function, the student subnetwork block can be trained under the ground truths and knowledge of the teacher subnetwork block. The details of our progressive blockwise learning scheme are shown in Alg. 1. (Section 2.2 “Progressive Blockwise Learning”; emphasis added). PNG media_image1.png 317 314 media_image1.png Greyscale (Emphasis added). Using such a loss function would thus entail receiving a first output from the corresponding teacher block, i.e. neighborhood, the first output having been produced by the block based on an input.); receiving a second output, the second output having been produced by the candidate student model based on the input (As noted above, Wang discloses that the plurality of subnetwork blocks of the student model are learned in a plurality of stages; at each learning stage, the parameters of a student subnetwork block are updated by using a loss function that is based, in part, on the output of the student subnetwork block compared with the output of its corresponding block in the teacher network, when both the student and teacher networks are given the same input. Using such a loss function would thus entail receiving an output produced by the student subnetwork block, i.e. the candidate student model, based on the input.); comparing the first output to the second output to generate a first training gradient corresponding to the candidate student model (As noted above, Wang discloses that at each learning stage, the parameters of a given student subnetwork block are updated by using a loss function that is based, in part, on the output of the student subnetwork block compared with the output of its corresponding block in the teacher network. A first training gradient, i.e. an amount to update the parameters of the student subnetwork block, is thus understandably generated based on this loss function, which is indicative of a comparison of the first output to the second output.); modifying one or more parameters of the candidate student model based at least in part on the first training gradient corresponding to the candidate student model (As noted above, Wang discloses that at each learning stage, the parameters of a given student subnetwork block are updated by using a loss function that is based, in part, on the output of the student subnetwork block compared with the output of its corresponding block in the teacher network. The parameters of the student subnetwork block are thus modified based at least in part on the first training gradient, i.e. an amount based on the loss function.); and identifying a selected model, the selected model being a copy of the candidate student model or a copy of the given neighborhood (As noted above, Wang discloses that the plurality of subnetwork blocks of the student model are learned in a plurality of stages, wherein each stage optimizes a respective student subnetwork block, i.e. a respective candidate student model. At the end of the training, a student model comprising all the learned subnetwork blocks of the student is returned. See e.g. “Algorithm 1,” which is excerpted above. Returning the student model would entail identifying a selected model, i.e. a learned student subnetwork block, for each block of the student model, the selected model being a copy of a candidate student model.); and combining the selected model corresponding to each given neighborhood of the plurality of neighborhoods to form a second neural network (As noted above, Wang discloses that the student neural network is designed to include a number of subnetwork blocks, each corresponding to a respective subnetwork block of the teacher neural network. As further noted above, Wang discloses that the plurality of subnetwork blocks of the student model are learned in a plurality of stages, wherein at the end of the training, a student model comprising all the learned subnetwork blocks of the student is returned. The student model is a combination of the selected models, i.e. learned student subnetwork blocks, each corresponding to a given neighborhood of the plurality of neighborhoods, i.e. corresponding to a subnetwork block of the plurality of subnetwork blocks of the teacher model. The returned student model is considered a second neural network like claimed.). Wang suggests that such teachings can be implemented on a system comprising a memory (i.e. to store programming instructions) and one or more processors coupled to the memory, the one or more processors being configured (i.e. by the programming instructions) to carry-out the above-noted tasks (see e.g. section 3.2 “Implementation Details,” which recites “[w]e implement our architecture using Caffe [Jia et al., 2014] and use an NVIDIA TITAN X GPU to train the network.”). Such a system for implementing the above-described teachings of Wang is considered a processing system similar to that of claim 11. Wang, however, does not disclose that a plurality of candidate student models are generated for each given neighborhood, wherein each of the plurality of candidate student models is trained (i.e. by producing a second output based on the input, by comparing the second output to the first output to generate a first gradient, and by modifying parameters of the model based on the gradient), and wherein the selected model for each given neighborhood is a copy of one of the plurality of candidate student models or a copy of the given neighborhood, as is required by claim 11. Moreover, Wang does not explicitly disclose that the first neural network has a first architecture configured to run on a first hardware type, and that each candidate student model of the plurality of candidate student models has a second architecture configured to run on a second hardware type different from the first hardware type, as is further required by claim 11. Like noted above, Li nevertheless generally teaches training a plurality of candidate models (i.e. candidate architectures) for each block of a neural network architecture: To address the above-mentioned issues, we propose a new solution to NAS [Neural Architecture Search] where the search space is large, while the potential candidate architectures can be fully and fairly trained. We consider a network architecture that has several blocks, conceptualized as analogous to the ventral visual blocks V1, V2, V4, and IT [25] (see Fig. 1). We then train each block of the candidate architectures separately. As guaranteed by the mathematical principle, the number of candidate architectures in a block reduces exponentially compared to the the number of candidates in the whole search space. Hence, the architecture candidates can be fully and fairly trained, while the representation shift caused by the shared parameters is reduced, leading to the correct candidate ratings. The correct and visiting-all evaluation improves the effectiveness of NAS. Moreover, thanks to the modest amount of the candidates in a block, we can even search for the depth of a block, which further improves the performance of NAS. (Section 1 “Introduction”; emphasis added). Specifically, a “supernet” of candidate architectures is divided into blocks, each block having a plurality of candidate architectures, and wherein each block of candidates corresponds to a block of a teacher network that is used to train the block of candidates: Block-wise NAS. [10] and [17] have suggested that when the search space is small, and all the candidates are fully and fairly trained, the evaluation could be accurate. To improve the accuracy of the evaluation, we divide the supernet into blocks of smaller sub-space. Specifically, Let N denote the supernet. We divide N into N blocks by the depth of the supernet and have: N =   N N … N i + 1 ∘ N i ∙ ∙ ∙ ∘ N 1 , (2) where N i + 1 ∘ N i denotes that the (i + 1)-th block is originally connected to the i-th block in the supernet. Then we learn each block of the supernet separately using: W i * =   min W i ⁡ L t r a i n W i ,   A i ,   X ,   Y i ,           i = 1,2 , ∙ ∙ ∙ , N , (3) where A i denote the search space in the i-th block. (Section 3.1 “Challenge of NAS and our Block-wise Search”; emphasis added). Although we motivate well in Section 3.1, a technical barrier in our block-wise NAS is that we lack of internal ground truth in Eqn. (3). Fortunately, we find that different blocks of an existing architecture have different knowledge in extracting different patterns of an image. We also find that the knowledge not only lies, as the literature suggests, in the network parameters, but also in the network architecture. Hence, we use the block-wise representation of existing models to supervise our architecture search. Let Y i be the output feature maps of the i-th block of the supervising model (i.e., teacher model) and Y ^ i ( x ) be the output feature maps of the i-th block of the supernet. We take L2 norm as the cost function. The loss function in Eqn. (3) can be written as: L t r a i n W i ,   A i ,   X ,   Y i =   1 K Y i -   Y ^ i ( x ) 2 2 , (6) where K denotes the numbers of the neurons in Y. (Section 3.2 “Block-wise Supervision with Distilled Architecture Knowledge”; footnote omitted and emphasis added). Li further suggests that a top candidate from each block is then selected to assemble a best student model: Evaluation. In our method, we aim to imitate the behavior of the teacher in every block. Thus, we estimate the learning ability of a student sub-model by its evaluation loss in each block. Our block-wise search make it possible to evaluate all the partial models (about 104 in each cell). To accelerate this process, we forward-propagate a batch of input node by node in a manner similar to deep first search, with intermediate output of each node saved and reused by subsequent nodes to avoid recalculating it from the beginning. The feature sharing evaluation algorithm is outlined in Algorithm 1. By evaluating all cells in a block of the supernet, we can get the evaluation loss of all possible paths in one block. We can easily sort this list with about 104 elements in a few seconds with a single CPU. After this, we can select the top-1 partial model from every block to assemble a best student. However, we still need to find efficient models under different constraints to meet the needs of real-life applications. (Section 3.4 “Search for the Best Student Under Constraint”; emphasis added). Li thus generally teaches generating and training a plurality of candidate student models for each block (i.e. neighborhood) of a teacher neural network model, and selecting a top candidate model from the plurality of candidate models for each block, whereby the selected candidate models are combined to form a student neural network. It would have been obvious to one of ordinary skill in the art, having the teachings of Wang and Li before the effective filing date of the claimed invention, to modify the processing system taught by Wang so as to generate a plurality of candidate student models for each block (i.e. neighborhood) of the teacher neural network (i.e. first neural network) like taught by Li, wherein each of the plurality of candidate student models is trained (i.e. by, in part, producing a second output based on the input, by comparing the second output to the first output of the corresponding teacher block to generate a first gradient, and by modifying parameters of the candidate student model based on the gradient like taught by Wang as noted above) and a top candidate model is selected from the plurality of candidate models for each block, whereby the top candidate models are combined to form the student neural network (i.e. second neural network). It would have been advantageous to one of ordinary skill to utilize such a combination because it would provide for the identification of an optimal architecture (and not just an optimal set of parameters) for the student neural network, as is suggested by Li (see e.g. section 1 “Introduction,” which states “[w]e also find that the knowledge not only lies, as the literature suggests, in the network parameters, but also in the network architecture… We have searched a number of architectures that have fewer parameters but significantly outperforms the supervising model, demonstrating the practicability and scalability of our DNA method.”). As noted above, the “background” section of the instant Application’s specification discloses AAPA, particularly, that knowledge distillation is traditionally employed to transfer knowledge from a first neural network that has a first architecture to a plurality of second candidate neural networks having a second architecture, wherein the first architecture is configured to run on a first hardware type and the second architecture is configured to run on a second hardware type different from the first hardware type: [0001] Knowledge distillation is a general model compression/optimization method that transfers the knowledge of a large teacher network (or set thereof) to a smaller student network, or from a network whose architecture is suited to run on one type of hardware to a network whose architecture is suited to run on a different type of hardware. However, in traditional end-to-end distillation, candidate student networks (which are usually a variant of the teacher network with a smaller number of layers and/or parameters, or with an architecture that is suited to a different type of hardware) must be individually trained to mimic the output of the teacher network, and then compared to one another in order to choose which student network is best in terms of complexity and/or accuracy. Because some layers or groups of layers in a deep neural network will be harder to distill than others, finding the ideal architecture for the student network can require consideration of a large number of candidate student networks, and thus can be both computationally expensive and time-consuming. (Applicant’s Specification, paragraph 0001. Emphasis Added.) It would have been obvious to one of ordinary skill in the art, having the teachings of Wang, Li and AAPA before the effective filing date of the claimed invention, to modify the processing system taught by Wang and Li such that the first neural network has a first architecture configured to run on a first hardware type, and such that each candidate student model of the plurality of candidate student models has a second architecture configured to run on a second hardware type different from the first hardware type, as is disclosed by AAPA. It would have been advantageous to one of ordinary skill to utilize such a combination, because the resulting neural network would have the knowledge/functionality of the teacher neural network (i.e. the first neural network), but would be better able to execute on the second hardware type, as is evident from AAPA (see paragraph 0001 of Applicant’s specification). Accordingly, Wang, Li and AAPA are considered to teach, to one of ordinary skill in the art, a processing system like that of claim 11. As per claim 14, it would have been obvious, as is described above, to modify the processing system taught by Wang so as to generate a plurality of candidate student models for each block (i.e. neighborhood) of the teacher neural network (i.e. first neural network) like taught by Li, wherein each of the plurality of candidate student models is trained and a top candidate model is selected from the plurality of candidate models for each block. Li suggests that the selected model (i.e. the top candidate model) is identified based at least in part on a comparison of a measurement (i.e. an “evaluation loss”) of how closely each candidate student model of the plurality of candidate student models approximates the output of the given neighborhood: Evaluation. In our method, we aim to imitate the behavior of the teacher in every block. Thus, we estimate the learning ability of a student sub-model by its evaluation loss in each block. Our block-wise search make it possible to evaluate all the partial models (about 104 in each cell). To accelerate this process, we forward-propagate a batch of input node by node in a manner similar to deep first search, with intermediate output of each node saved and reused by subsequent nodes to avoid recalculating it from the beginning. The feature sharing evaluation algorithm is outlined in Algorithm 1. By evaluating all cells in a block of the supernet, we can get the evaluation loss of all possible paths in one block. We can easily sort this list with about 104 elements in a few seconds with a single CPU. After this, we can select the top-1 partial model from every block to assemble a best student. However, we still need to find efficient models under different constraints to meet the needs of real-life applications. (Section 3.4 “Search for the Best Student Under Constraint”; emphasis added). Accordingly, the above-described combination of Wang, Li and AAPA is further considered to teach a system like that of claim 14. As per claim 16, Wang teaches that the input (i.e. the input to the teacher block/neighborhood and to the candidate student block/model) comprises an output received from a neighborhood (i.e. block) preceding the given neighborhood (see e.g. section 2.2 “Progressive Blockwise Learning:” equation (11) therein demonstrates that the input to teacher block t k and student block s k comprises the output of student block s k - 1 ). Li provides a similar teaching (see e.g. section 3.2 “Block-wise Supervision with Distilled Architecture Knowledge,” which states “[s]pecifically, for each block, we use the output Y i - 1 of the (i − 1)-th block of the teacher model as the input of the i-th block of the supernet.”). Accordingly, the above-described combination of Wang, Li and AAPA is further considered to teach a system like that of claim 16. Claims 2, 3, 12 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Wang, Li and AAPA, which is described above, and also over the article entitled “A Novel Layerwise Pruning Method for Model Reduction of Fully Connected Deep Neural Networks” by Mauch et al. (“Mauch”). Regarding claims 2 and 12, Wang, Li and AAPA teach a method like that of claim 1 and a system like that of claim 11, as are described above, and which entail identifying a selected model, the selected model being a copy of one of a plurality of candidate student models or a copy of a given neighborhood. As noted above (see the rejections for claims 4 and 14), Li suggests that the selected model (i.e. a top candidate model) can be identified based at least in part on a comparison of a measurement (i.e. an accuracy) of how closely each candidate student model of the plurality of candidate student models approximates the output of the given neighborhood. Wang, Li and AAPA, however, do not explicitly disclose that the selected model is identified based at least in part on a comparison of a size of each candidate student model of the plurality of candidate student models, as is required by claims 2 and 12. Mauch nevertheless generally teaches that it is advantageous to reduce the size and number of layers of a deep neural network (DNN) as much as possible while maintaining its accuracy, because larger DNNs are computationally demanding and are therefore not suitable for mobile and embedded devices: Deep neural networks (DNN) are the state of the art method for many machine learning tasks such as image recognition, segmentation and natural language processing [1 , 2]. However, evaluating a trained DNN is computationally demanding if the network is very deep and consists of layers with many neurons. Therefore, DNNs are limited to applications with enough computational power and memory. They are not suitable for mobile and embedded devices. Both the ability of a DNN to learn an arbitrary nonlinear relationship and its computational complexity depend on the depth of the network and the width of its layers [3, 4]. A DNN with many parameters allows for learning an arbitrary complex transfer function, but is computationally expensive. Using a shallow DNN with a few narrow layers reduces the computational complexity, but also limits the achievable complexity of the transfer function [5]. There is no theory today how to choose the most efficient DNN structure, i.e. a DNN with a minimum number of layers and neurons while allowing for a transfer function that is complex enough to solve a given task. In practice, very deep networks with a huge number of parameters are favoured because they can solve a broad class of tasks with high accuracy [6, 7]. However, most likely the chosen DNN structure has much more parameters than needed and a model with less parameters could solve the task with the same accuracy. To obtain an efficient DNN, we could optimize the DNN structure during training, e.g. by varying the number of layers and neurons. But this leads to a difficult combinatorial optimization problem. Another possibility is to start with a trained model with many parameters and try to reduce the number of parameters after training without reducing the accuracy. This technique is called model reduction. (Section 1 “Introduction”; emphasis added). It therefore would have been obvious to one of ordinary skill in the art, having the teachings of Wang, Li, AAPA and Mauch before the effective filing date of the claimed invention, to modify the method and system taught by Wang, Li and AAPA so as to also consider model size (i.e. in addition to accuracy) like taught by Mauch when identifying the selected model from the plurality of candidate student models; that is it would have been obvious to identify the selected model based at least in part on a comparison of a size of each candidate student model of the plurality of candidate student models. It would have been advantageous to one of ordinary skill to utilize such a combination because it could result in a reduced model (i.e. the student neural network) that is more suitable for a mobile and embedded device, as is suggested by Mauch (see e.g. the portion of Section 1 “Introduction” excerpted above). Accordingly, Wang, Li, AAPA and Mauch are considered to teach, to one of ordinary skill in the art, a method like that of claim 2 and a system like that of claim 12. Regarding claims 3 and 13, Wang, Li and AAPA teach a method like that of claim 1 and a system like that of claim 11, as are described above, and which entail identifying a selected model, the selected model being a copy of one of a plurality of candidate student models or a copy of a given neighborhood. As noted above (see the rejections for claims 4 and 14), Li suggests that the selected model (i.e. a top candidate model) can be identified based at least in part on a comparison of a measurement (i.e. an accuracy) of how closely each candidate student model of the plurality of candidate student models approximates the output of the given neighborhood. Wang, Li and AAPA, however, do not explicitly disclose that the selected model is identified based at least in part on a comparison of a number of layers of each candidate student model of the plurality of candidate student models, as is required by claims 3 and 13. Mauch nevertheless generally teaches that it is advantageous to reduce the size and number of layers of a deep neural network (DNN) as much as possible while maintaining its accuracy, because larger DNNs are computationally demanding and are therefore not suitable for mobile and embedded devices: Deep neural networks (DNN) are the state of the art method for many machine learning tasks such as image recognition, segmentation and natural language processing [1 , 2]. However, evaluating a trained DNN is computationally demanding if the network is very deep and consists of layers with many neurons. Therefore, DNNs are limited to applications with enough computational power and memory. They are not suitable for mobile and embedded devices. Both the ability of a DNN to learn an arbitrary nonlinear relationship and its computational complexity depend on the depth of the network and the width of its layers [3, 4]. A DNN with many parameters allows for learning an arbitrary complex transfer function, but is computationally expensive. Using a shallow DNN with a few narrow layers reduces the computational complexity, but also limits the achievable complexity of the transfer function [5]. There is no theory today how to choose the most efficient DNN structure, i.e. a DNN with a minimum number of layers and neurons while allowing for a transfer function that is complex enough to solve a given task. In practice, very deep networks with a huge number of parameters are favoured because they can solve a broad class of tasks with high accuracy [6, 7]. However, most likely the chosen DNN structure has much more parameters than needed and a model with less parameters could solve the task with the same accuracy. To obtain an efficient DNN, we could optimize the DNN structure during training, e.g. by varying the number of layers and neurons. But this leads to a difficult combinatorial optimization problem. Another possibility is to start with a trained model with many parameters and try to reduce the number of parameters after training without reducing the accuracy. This technique is called model reduction. (Section 1 “Introduction”; emphasis added). It therefore would have been obvious to one of ordinary skill in the art, having the teachings of Wang, Li, AAPA and Mauch before the effective filing date of the claimed invention, to modify the method and system taught by Wang, Li and AAPA so as to also consider the number of layers (i.e. in addition to accuracy) like taught by Mauch when identifying the selected model from the plurality of candidate student models; that is it would have been obvious to identify the selected model based at least in part on a comparison of a number of layers of each candidate student model of the plurality of candidate student models. It would have been advantageous to one of ordinary skill to utilize such a combination because it could result in a reduced model (i.e. the student neural network) that is more suitable for a mobile and embedded device, as is suggested by Mauch (see e.g. the portion of Section 1 “Introduction” excerpted above). Accordingly, Wang, Li, AAPA and Mauch are considered to teach, to one of ordinary skill in the art, a method like that of claim 3 and a system like that of claim 13. Claims 5 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Wang, Li and AAPA, which is described above, and also over U.S. Patent Application Publication No. 2020/0184333 to Oh (“Oh”). Regarding claims 5 and 15, Wang, Li and AAPA teach a method like that of claim 4 and a system like that of claim 14, as are described above, and which entail identifying a selected model based at least in part on a comparison of a measurement of how closely each candidate student model (i.e. student neural network block) of a plurality of candidate student models approximates an output of a given neighborhood (i.e. a teacher neural network block). Wang teaches that a measurement of how closely a candidate student model approximates the output of a given neighborhood can be based at least in part on an error between an output of the given neighborhood based on a given input and an output of the candidate student model based on the given input (see e.g. section 2.2 “Progressive Blockwise Learning,” which describes a loss function that has a first term based on a difference, i.e. an error, between an output of the teacher neural network block and an output of the student neural network block when given the same input). Li provides a similar teaching (see e.g. section 3.2 “Block-wise Supervision with Distilled Architecture Knowledge,” which describes a loss function similar to that taught by Wang). Accordingly, Wang, Li and AAPA are further considered to teach a method similar to that of claim 5 and a system similar to that of claim 15. Wang, Li and AAPA, however, do not explicitly disclose that the error is a mean square error like required by claims 5 and 15. Oh nevertheless generally teaches using a mean squared error as a measurement of how closely a neural network approximates a correct output (see e.g. paragraphs 0078-0079) It would have been obvious to one of ordinary skill in the art, having the teachings of Wang, Li, AAPA and Oh before the effective filing date of the claimed invention, to modify the method and system taught by Wang, Li and AAPA such that the determined error (i.e. the error between the output of the given neighborhood and the output of each candidate student model) is a mean squared error like taught by Oh. It would have been advantageous to one of ordinary skill to utilize such a mean squared error because it can provide a good indication of accuracy, as is evident from Oh (see e.g. paragraphs 0078-0079). Accordingly, Wang, Li, AAPA and Oh are considered to teach, to one of ordinary skill in the art, a method like that of claim 5 and a system like that of claim 15. Claims 7, 8, 10, 17, 18 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Wang, Li and AAPA, which is described above, and also over the article entitled “FitNets: Hints for Thin Deep Nets” by Romero et al. (“Romero”). Regarding claims 7 and 17, Wang, Li and AAPA teach a method like that of claim 1 and a system like that of claim 11, as are described above, and which entail dividing a first neural network into a plurality of neighborhoods, and whereby for each given neighborhood: (i) a plurality of candidate student models is generated; (ii) a first output produced by the given neighborhood based on an input is received; (iii) a plurality of second outputs respectively produced by candidate student models of the plurality of candidate student models is received; and (iv) the first output is compared to each of the plurality of second outputs to generate a first training gradient corresponding to each candidate student model. Like in claims 7 and 17, Wang further teaches, for each neighborhood of the plurality of neighborhoods: (i) providing (i.e. by one or more processors) a second output (i.e. an output of a candidate student model/student subnetwork block) to a head model, the head model comprising a copy of a portion of the first neural network which directly follows the given neighborhood; (ii) receiving (i.e. by the one or more processors) an additional output (i.e. a “fourth output” like claimed), the additional output having been produced by the head model based on the second output; (iii) comparing (i.e. by the one or more processors) a ground truth to the additional output to generate a second training gradient corresponding to the candidate student model; and (iv) modifying (i.e. by the one or more processors) the one or more parameters of the candidate student model based at least in part on the second training gradient corresponding to the candidate student model (Wang discloses that the loss function comprises a second term that is based on a loss between the network’s output and a ground truth, and wherein the parameters of the student subnetwork block are also updated based on the value of the this second term: The second classification loss term L c l s k is to make the output of the student network approximate the ground truth, which is defined as: L c l s k I , y ; W s k = s o f t m a x ( A k I ; W A k , y ) (12) where y is the ground truth for I and softmax(.) means the softmax loss between the network’s output and y as described in [Simonyan and Zisserman, 2015] Then the objective loss function for one training sample (I, y) at the k-th block learning stage is as below: L k I , y ; W s k =   λ l o c a l L l o c a l k I ; W s k +   L c l s k ( I , y ; W s k ) (13) where λ l o c a l is a hyper-parameter to balance these two terms of the loss function. Therefore, our objective loss function for the training data { I 1 , y 1 , … , I M , y M } is: L k W s k =   1 M ∑ m = 1 M L k ( I m ,   y m ; W s k ) (14) By optimizing this loss function, the student subnetwork block can be trained under the ground truths and knowledge of the teacher subnetwork block. The details of our progressive blockwise learning scheme are shown in Alg. 1. (Section 2.2 “Progressive Blockwise Learning”; emphasis added). Determining the value of this second term of the loss function for a given training input would entail providing the output of the student subnetwork block to the portion of the neural network that directly follows the student subnetwork block, i.e. to a head model, so that the neural network can produce an output, i.e. an additional output/“fourth output,” which is produced by the head model based on the output of the student subnetwork block. The second term of the loss function is indicative of a comparison between this additional output and a ground truth, whereby a second training gradient, i.e. an amount to update the parameters of the student subnetwork block, is thus understandably generated based on the second term of the loss function.). As described above, it would have been obvious to modify the method and system taught by Wang so as to generate a plurality of candidate student models for each block (i.e. neighborhood) of the teacher neural network (i.e. first neural network) like taught by Li, wherein each of the plurality of candidate student models is trained and a top candidate model is selected from the plurality of candidate models for each block. It thus follows that the loss function taught by Wang would be applied for each of the plurality of candidate student models for each block, i.e. wherein each of the plurality of second outputs is provided to the head model, a plurality of additional/fourth outputs is received, and each additional/fourth output is compared to the ground truth to generate and apply a second training gradient corresponding to each candidate student model of the plurality of candidate student models. Accordingly, Wang, Li and AAPA are further considered to teach a method similar to that of claim 7 and a system similar to that of claim 17. Wang, Li and AAPA, however, do not explicitly teach: providing the first output (i.e. the output of the teacher block/neighborhood) to the head model; receiving a third output from the head model, wherein the third output is produced by the head model based on the first output; and comparing the third output to each of the fourth outputs of the plurality of fourth outputs to generate the second gradient corresponding to each candidate student model of the plurality of candidate student models, as is required by claims 7 and 17. Romero nevertheless teaches using a teacher neural network to train a student neural network, wherein the parameters of the student neural network are updated via a loss function that includes a term that compares an output of the student neural network with a ground truth, and also term that compares the output of the student neural network with an output of the teacher neural network when given the same input (see e.g. section 2.1 “Review of Knowledge Distillation” and section 2.3 “FitNet Stage-Wise Training”). It would have been obvious to one of ordinary skill in the art, having the teachings of Wang, Li, AAPA and Romero before the effective filing date of the claimed invention, to modify the method and system taught by Wang, Li and AAPA such that the loss function also includes a term that compares the output of the student neural network with an output of the teacher neural network when given the same input, as is taught by Romero. Using such a loss function for each of the candidate student models would thus entail: providing the output of the teacher block/neighborhood (i.e. the first output) to the head model; receiving a third output from the head model (i.e. from the neural network), wherein the third output is produced by the head model based on the first output; and applying the loss function to compare the third output with each of the fourth outputs of the plurality of fourth outputs to generate the second gradient corresponding to each candidate student model of the plurality of candidate student models. It would have been advantageous to one of ordinary skill to utilize such a combination because it would enable the student neural network to further learn from the teacher neural network, as is evident from Romero (see e.g. section 2.1 “Review of Knowledge Distillation”). Accordingly, Wang, Li, AAPA and Romero are considered to teach, to one of ordinary skill in the art, a method like that of claim 7 and a system like that of claim 17. As per claims 8 and 18, it would have been obvious, as is described above, to modify the method and system taught by Wang so as to generate a plurality of candidate student models for each block (i.e. neighborhood) of the teacher neural network (i.e. first neural network) like taught by Li, wherein each of the plurality of candidate student models is trained and a top candidate model is selected from the plurality of candidate models for each block. Li suggests that the selected model (i.e. the top candidate model) is identified based at least in part on a comparison of a measurement (i.e. an “evaluation loss”) of how closely each candidate student model of the plurality of candidate student models approximates the output of the given neighborhood: Evaluation. In our method, we aim to imitate the behavior of the teacher in every block. Thus, we estimate the learning ability of a student sub-model by its evaluation loss in each block. Our block-wise search make it possible to evaluate all the partial models (about 104 in each cell). To accelerate this process, we forward-propagate a batch of input node by node in a manner similar to deep first search, with intermediate output of each node saved and reused by subsequent nodes to avoid recalculating it from the beginning. The feature sharing evaluation algorithm is outlined in Algorithm 1. By evaluating all cells in a block of the supernet, we can get the evaluation loss of all possible paths in one block. We can easily sort this list with about 104 elements in a few seconds with a single CPU. After this, we can select the top-1 partial model from every block to assemble a best student. However, we still need to find efficient models under different constraints to meet the needs of real-life applications. (Section 3.4 “Search for the Best Student Under Constraint”; emphasis added). Accordingly, the above-described combination of Wang, Li, AAPA and Romero is further considered to teach a method like that of claim 8 and a system like that of claim 18. As per claims 10 and 20, Wang teaches that the input (i.e. the input to the teacher block/neighborhood and to the candidate student block/model) comprises an output received from a neighborhood (i.e. block) preceding the given neighborhood (see e.g. section 2.2 “Progressive Blockwise Learning;” equation (11) therein demonstrates that the input to teacher block t k and student block s k comprises the output of student block s k - 1 ). Li provides a similar teaching (see e.g. section 3.2 “Block-wise Supervision with Distilled Architecture Knowledge,” which states “[s]pecifically, for each block, we use the output Y i - 1 of the (i − 1)-th block of the teacher model as the input of the i-th block of the supernet.”). Accordingly, the above-described combination of Wang, Li, AAPA and Romero is further considered to teach a method like that of claim 10 and a system like that of claim 20. Claims 9 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Wang, Li, AAPA and Romero, which is described above, and also over U.S. Patent Application Publication No. 2020/0184333 to Oh (“Oh”). Regarding claims 9 and 19, Wang, Li, AAPA and Romero teach a method like that of claim 8 and a system like that of claim 18, as are described above, and which entail identifying a selected model based at least in part on a comparison of a measurement of how closely each candidate student model (i.e. student neural network block) of a plurality of candidate student models approximates an output of a given neighborhood (i.e. a teacher neural network block). Wang teaches that a measurement of how closely a candidate student model approximates the output of a given neighborhood can be based at least in part on an error between an output of the given neighborhood based on a given input and an output of the candidate student model based on the given input (see e.g. section 2.2 “Progressive Blockwise Learning,” which describes a loss function that has a first term based on a difference, i.e. an error, between an output of the teacher neural network block and an output of the student neural network block when given the same input). Li provides a similar teaching (see e.g. section 3.2 “Block-wise Supervision with Distilled Architecture Knowledge,” which describes a loss function similar to that taught by Wang). Accordingly, Wang, Li, AAPA and Romero are further considered to teach a method similar to that of claim 9 and a system similar to that of claim 19. Wang, Li, AAPA and Romero, however, do not explicitly disclose that the error is a mean square error like required by claims 9 and 19. Oh nevertheless generally teaches using a mean squared error as a measurement of how closely a neural network approximates a correct output (see e.g. paragraphs 0078-0079) It would have been obvious to one of ordinary skill in the art, having the teachings of Wang, Li, AAPA, Romero and Oh before the effective filing date of the claimed invention, to modify the method and system taught by Wang, Li, AAPA and Romero such that the determined error (i.e. the error between the output of the given neighborhood and the output of each candidate student model) is a mean squared error like taught by Oh. It would have been advantageous to one of ordinary skill to utilize such a mean squared error because it can provide a good indication of accuracy, as is evident from Oh (see e.g. paragraphs 0078-0079). Accordingly, Wang, Li, AAPA, Romero and Oh are considered to teach, to one of ordinary skill in the art, a method like that of claim 9 and a system like that of claim 19. Response to Arguments The Examiner acknowledges the Applicant’s amendments to claims 1 and 11. The Applicant’s arguments concerning the 35 U.S.C. § 103 rejections presented in the previous Office Action have been considered, but are moot in view of the new grounds of rejection presented above, which are required in response to the Applicant’s amendments. Conclusion The prior art made of record on form PTO-892 and not relied upon is considered pertinent to applicant’s disclosure. The applicant is required under 37 C.F.R. §1.111(C) to consider these references fully when responding to this action. In particular, the article by Turner et al. cited therein (“HAKD: Hardware Aware Knowledge Distillation”) generally teaches using empirical observations of hardware behavior to design efficient student networks which are trained with knowledge distillation. The article by Zhang et al. cited therein (“Fast Hardware-aware Neural Architecture Search”) describes a Neural Architecture Search that efficiently generates tailored models for different types of hardware. Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. Any inquiry concerning this communication or earlier communications from the examiner should be directed to BLAINE T BASOM whose telephone number is (571)272-4044. The examiner can normally be reached Monday-Friday, 9:00 am - 5:30 pm, EST. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Matt Ell can be reached at (571)270-3264. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /BTB/ 1/21/2026 /MATTHEW ELL/Supervisory Patent Examiner, Art Unit 2141
Read full office action

Prosecution Timeline

Show 4 earlier events
Sep 10, 2025
Examiner Interview Summary
Sep 25, 2025
Response Filed
Jan 26, 2026
Final Rejection mailed — §103
Mar 19, 2026
Examiner Interview Summary
Mar 19, 2026
Applicant Interview (Telephonic)
Mar 25, 2026
Response after Non-Final Action
Apr 01, 2026
Request for Continued Examination
Apr 02, 2026
Response after Non-Final Action

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12632794
METHOD AND SYSTEM FOR CROSS-CHAIN CONSENSUS ORIENTED TO FEDERATED LEARNING
4y 5m to grant Granted May 19, 2026
Patent 12608647
MULTIMODAL DATA INFERENCE
3y 10m to grant Granted Apr 21, 2026
Patent 12566981
METHOD AND SYSTEM FOR EVENT PREDICTION BASED ON TIME-DOMAIN BOOTSTRAPPED MODELS
4y 9m to grant Granted Mar 03, 2026
Patent 12487727
Sensory Adjustment Mechanism
5y 8m to grant Granted Dec 02, 2025
Patent 12443420
Automatic Image Conversion
3y 8m to grant Granted Oct 14, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

2-3
Expected OA Rounds
43%
Grant Probability
66%
With Interview (+22.7%)
4y 6m (~9m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 326 resolved cases by this examiner. Grant probability derived from career allowance rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month