Office Action Analysis: 18170888 — DISTILLATION OF DEEP ENSEMBLES

Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Status of Claims
Claims 1 – 20 are pending and examined herein. 
Claims 1 – 20 are rejected under 35 U.S.C. 103.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1 – 2, 4, 8,  9 – 10, 12, 16 are rejected under 35 U.S.C. 103 as being unpatentable over Shamir et al. (U.S. Pub. 2021/0158156 A1) in view of Walawalkar et al. (NPL:”Online Ensemble Model Compression using Knowledge Distillation”).
	Regarding Claim 1, Shamir teaches
A system, comprising: a processor that executes computer-executable components stored in a non-transitory computer-readable memory, the computer-executable components comprising: an access component that accesses a deep learning ensemble configured to perform an inferencing task; ([0006] of Shamir states “The computing system includes one or more processors and one or more non-transitory computer-readable media that collectively store: an ensemble that comprises a plurality of neural networks;…. the operations include: accessing, by the computing system, one or more training examples; processing, by the computing system, each of the one or more training examples with the ensemble to obtain an ensemble output from the ensemble;” ,which the ensemble is executed to produce inferencing output.)
Sharmir does not explicitly teach and a distillation component that iteratively distills the deep learning ensemble into a smaller deep learning ensemble configured to perform the inferencing task, wherein a current distillation iteration involves training a new neural network of the smaller deep learning ensemble via a loss function that is based on one or more neural networks of the smaller deep learning ensemble which were trained during one or more previous distillation iterations. 
However, Walawalkar teaches 
and a distillation component that iteratively distills the deep learning ensemble into a smaller deep learning ensemble configured to perform the inferencing task, wherein a current distillation iteration involves training a new neural network of the smaller deep learning ensemble via a loss function that is based on one or more neural networks of the smaller deep learning ensemble which were trained during one or more previous distillation iterations. (Pg. 3 of Walawalkar states “We present a framework which enables multiple student model training using knowledge transfer from an ensemble teacher. Each of these student models represents a version of the original model compressed to different degrees. Knowledge is distilled onto each of the compressed student through the ensemble teacher and also through intermediate representations from a pseudo teacher.” Pg. 6 of Walawalkar states “During inference stage, any of the individual student models can be selected from the ensemble depending on the computational hardware constraints. In case of lenient constraints, the entire ensemble can be used with the ensemble teacher providing inference based on the learnt ensemble knowledge… The loss in every student’s representational capacity due to compression is countered, by making each student block try and learn the intermediate feature map of its corresponding pseudo teacher block. The feature map pairs are compared using traditional Mean Squared Error loss, on which the network is trained to reduce any differences between them.” Pg. 5 of Walawalkar states “For every successive student branch the number of channels in each layer of its blocks is reduced by a certain ratio with respect to the pseudo teacher. This ratio becomes higher for every new student branch created. For example, for a four student ensemble and C being number of channels in pseudo teacher, the channels in other three students are assigned to be 0.75C, 0.5C and 0.25C. The students are compressed versions of the original model to varying degrees, which still manage to maintain the original network depth. The channels in the common base block are kept the same as original whose main purpose is to provide constant low level features to all the student branches.” Walawalkar teaches creating multiple student branches (an ensemble of students) and progressively compressing them and selecting students for deployment under hardware constraints, like the distilling into a smaller ensemble.)
It would have been obvious to one with ordinary skill in the art before the effective filing date of the invention to combine the teachings of Shamir and Walawalkar. Shamir teaches distilling a deep learning ensemble into a smaller neural network for inference using a distillation loss based on ensemble outputs and training data. Walawalkar teaches training student networks from ensemble teachers using a loss that includes a supervised term based on ground truth labels and distillation term based on teacher outputs. One with ordinary skill in the art would be motivated to incorporate the teachings of Walawalkar into the Shamir to improve the accuracy of the distilled model while maintaining the benefit of reduced model size and deployment efficiency. The combination would have been predictable use of known distillation techniques to achieve a smaller model while preserving ensemble models. 

Regarding claim 2, the rejection of claim 1 is incorporated herein. Furthermore, the combination of Shamir and Walawalkar teaches
wherein the access component accesses a training dataset, and wherein, during the current distillation iteration, the distillation component: ([0048] of Shamir states “Specifically, as illustrated in FIG. 1A, a set of training data 162 can include a number of training pairs. Each training pair can include a training example 12 and a ground truth label 14 associated with the training example 12. In the illustrated scheme, the training example 12, is supplied as input to the single neural network 30 and also to each component model 22 a-c of the ensemble 20.”)
initializes trainable internal parameters of the new neural network; (Pg. 7 of Walawalkar states “This loss procedure makes the pseudo teacher learn alongside the compressed students as the framework doesn’t use pretrained weights of any sort. It also helps the ensemble teacher gain richer knowledge of the dataset as it incorporates combination of every student’s learnt knowledge.” Training without pretrained weight in BRI can imply that the new network’s trainable internal parameters are initialized for training.)
selects, from the training dataset, one or more training data candidates and one or more ground-truth annotations corresponding to the one or more training data candidates; ([0048] of Shamir states “Specifically, as illustrated in FIG. 1A, a set of training data 162 can include a number of training pairs. Each training pair can include a training example 12 and a ground truth label 14 associated with the training example 12. In the illustrated scheme, the training example 12, is supplied as input to the single neural network 30 and also to each component model 22 a-c of the ensemble 20.” They pair training examples with ground truth label (i.e., annotation).)
executes the new neural network on the one or more training data candidates, thereby yielding one or more first inferencing task outputs; ([0050] of Shamir states “Similarly, the single neural network 30 can process the training example 12 to generate a network output 34 based on the training example 12. In some implementations, the network output 34 can be a single output for all “labels” (e.g., true, and distilled) or multiple outputs, where each output is matched to one or more of the available labels. Further discussion in this regard is provided with reference to FIG. 1B.”)
executes the deep learning ensemble on the one or more training data candidates, thereby yielding one or more second inferencing task outputs; ([0048] of Shamir states  “Each training pair can include a training example 12 and a ground truth label 14 associated with the training example 12. In the illustrated scheme, the training example 12, is supplied as input to the single neural network 30 and also to each component model 22 a-c of the ensemble 20. In some implementations, an entirety of the features of the training example 12 are supplied to each component model 22 a-c. In other implementations, to increase diversity within the ensemble 20, different overlapping or non-overlapping subsets of the features of the training example 12 can be respectively supplied to the different component models 22 a-c of the ensemble 20.” [0050] of Shamir states “The ensemble 20 can process the training example 12 to generate an ensemble output 24 based on the training example 12. For example, the ensemble output 24 can be an aggregate (e.g., average, majority vote, etc.) of the respective outputs of the component models 22 a-c.”)
updates, via backpropagation, the trainable internal parameters of the new neural network based on the loss function, ([0055] of Shamir states “The distillation loss 42 (e.g., in combination with the supervised loss 44) can be used to train the single neural network 30. For example, the distillation loss 42 (e.g., in combination with the supervised loss 44) can be backpropagated through the single neural network 30 to update the values of the parameters (e.g., weights) of the single neural network 30.”)
wherein the loss function includes a first term that quantifies errors between the one or more first inferencing task outputs and the one or more ground-truth annotations, ([0051] of Shamir states “As is generally performed for supervised learning, a supervised loss 44 can be generated based on a comparison of the network output 34 and the ground truth label 14. Thus, the supervised loss 44 can penalize differences between the network output 34 and the ground truth label 14. Or, it can be defined as a difference based loss between values computed by the teacher and the respective ones computed by the student, including logit values in the top head.”)
wherein the loss function includes a second term that quantifies errors between the one or more first inferencing task outputs and the one or more second inferencing task outputs, ([0053] of Shamir states “According to an aspect of the present disclosure, in addition or alternatively to the supervised loss 44, a distillation loss 42 can be generated based on a comparison of the network output 34 and the ensemble output 24. The distillation loss 42 can be computed for the final predictions of the network 30 and the ensemble 20 or can be computed at an earlier stage (e.g., in the logit space). Thus, the distillation loss 42 can penalize differences between the network output 34 and the ensemble output 24.”)
and wherein the loss function includes a third term that quantifies similarities between the new neural network and the one or more neural networks of the smaller deep learning ensemble which were trained during the one or more previous distillation iterations; (Pg. 6 of Walawalkar states “The loss in every student’s representational capacity due to compression is countered, by making each student block try and learn the intermediate feature map of its corresponding pseudo teacher block. The feature map pairs are compared using traditional Mean Squared Error loss, on which the network is trained to reduce any differences between them.”) 
and repeats respective above acts until a training termination criterion associated with the new neural network is satisfied. (Pg. 11 of Walawalkar states “For fair comparison the ensemble and each of the baseline students are trained for the same number epochs on both datasets.”)

Regarding claim 4, the rejection of claim 2 is incorporated herein. Furthermore, the combination of Shamir and Walawalkar teaches
the third term of the loss function is based on reciprocals of distances between: hidden feature maps produced by the new neural network; and hidden feature maps produced by the one or more neural networks of the smaller deep learning ensemble which were trained during the one or more previous distillation iterations. (Pg. 6 of Walawalkar states “The loss in every student’s representational capacity due to compression is countered, by making each student block try and learn the intermediate feature map of its corresponding pseudo teacher block. The feature map pairs are compared using traditional Mean Squared Error loss, on which the network is trained to reduce any differences between them. Since the number of feature map channels varies across every corresponding student and pseudo teacher block, an adaptation layer consisting of pointwise convolution (1×1 kernel) is used to map compressed student block channels to its pseudo teacher counterpart” MSE is a distance between feature maps; using reciprocal of a distance is a transform of the taught distance measure. )	

Regarding claim 8, the rejection of claim 1 is incorporated herein. Furthermore, the combination of Shamir and Walawalkar teaches
the deep learning ensemble serves as a group of network heads for a common backbone network, and wherein the smaller deep learning ensemble replaces the deep learning ensemble as the group of network heads for the common backbone network. (Pg. 5 of Walawalkar states “First, the entire architecture of a given neural network is broken down into a series of layer blocks, ideally into four blocks. The first block is designated as a common base block and the rest of the blocks are replicated in parallel to create branches as shown in Figure 1. A single student model can be viewed as a series of base block and one of the branches on top of it.”)

Claims 9 – 10, 12, 16 recite substantially similar subject matter as claims 1 – 2, 4, 8 respectively, and are rejected with the same rationale, mutatis mutandis.

Claims 3, 11 are rejected under 35 U.S.C. 103 as being unpatentable over Shamir et al. (U.S. Pub. 2021/0158156 A1) in view of Walawalkar et al. (NPL:”Online Ensemble Model Compression using Knowledge Distillation”), further in view of Tang et al. (NPL:”Understanding and Improving Knowledge Distillation).
Regarding claim 3, the rejection of claim 2 is incorporated herein. Furthermore, the combination of Shamir and Walawalkar does not explicitly teach
wherein the third term of the loss function is based on cosine similarities between: the trainable internal parameters of the new neural network; and trainable internal parameters of the one or more neural networks of the smaller deep learning ensemble which were trained during the one or more previous distillation iterations. 
However, Tang teaches that wherein the third term of the loss function is based on cosine similarities between: the trainable internal parameters of the new neural network; and trainable internal parameters of the one or more neural networks of the smaller deep learning ensemble which were trained during the one or more previous distillation iterations. (Pg. 4 of Tang states “In (d), we show cosine similarities computed from the weights of the final logits layer.” Pg. 5 of Tang states “Thus, we create a distribution ρ sim as the softmax over cosine similarity3 of the weights: ρ sim = softmax(wˆtWˆ >), where Wˆ ∈ R K×d is the l2-normalized logit layer weights, and wˆt = wt/kwtk is the t-th row of Wˆ corresponding to the ground truth.”)
It would have been obvious to one with ordinary skill in the art before the effective filing date of the invention to combine the teachings of Shamir, Walawalkar, and Tang. Shamir teaches distilling a deep learning ensemble into a smaller neural network for inference using a distillation loss based on ensemble outputs and training data. Walawalkar teaches training student networks from ensemble teachers using a loss that includes a supervised term based on ground truth labels and distillation term based on teacher outputs. Tang teaches using cosine similarity as a meaningful similarity measure between learned vectors. One with ordinary skill in the art would be motivated to incorporate the teachings of Tang into the combination of Shamir and Walawalkar to use cosine similarity as it is known similarity measure method for computation efficiency in high dimensional representations. The combination would have been predictable to improve control over similarity between newly and previously trained student networks. 

Claim 11 recites substantially similar subject matter as claim 3 respectively, and is rejected with the same rationale, mutatis mutandis.

Claims 5, 6, 13, 17, 18, 20 are rejected under 35 U.S.C. 103 as being unpatentable over Shamir et al. (U.S. Pub. 2021/0158156 A1) in view of Walawalkar et al. (NPL:”Online Ensemble Model Compression using Knowledge Distillation”), further in view of Noothout et al. (NPL:”Knowledge distillation with ensembles of convolutional neural networks for medical image segmentation”).
Regarding claim 5, the rejection of claim 2 is incorporated herein. Furthermore, the combination of Shamir and Walawalkar teaches
in response to the training termination criterion being satisfied by the new neural network, ([0029] of Shamir states “Training of neural networks is typically done by visiting all the labeled training examples, possibly iterating through the examples multiple times (epochs).”)
commences a next distillation iteration, (Pg. 5 of Walawalkar states “For every successive student branch the number of channels in each layer of its blocks is reduced by a certain ratio with respect to the pseudo teacher. This ratio becomes higher for every new student branch created.”)
The combination of Shamir and Walawalkar does not explicitly teach 
computes a current performance metric of the smaller deep learning ensemble; 
compares the current performance metric to a previous performance metric of the smaller deep learning ensemble that was computed during a previous distillation iteration; 
, in response to the current performance metric differing from the previous performance metric by more than a threshold margin;
and determines that the smaller deep learning ensemble is complete, in response to the current performance metric differing from the previous performance metric by less than the threshold margin. 
However, Noothout teaches that 
computes a current performance metric of the smaller deep learning ensemble; (Pg. 6 of Noothout states “Evaluation of each separate network in an ensemble, the ensembles, and the distilled networks was performed by computing two different evaluation metrics. For each foreground class separately, the Dice coefficient was computed to evaluate the volume overlap between predicted and reference segmentations”)
compares the current performance metric to a previous performance metric of the smaller deep learning ensemble that was computed during a previous distillation iteration; (Pg. 11 of Noothout “For each ensemble, the network architecture that obtained the highest Dice coefficients and lowest distance errors, averaged over all classes, was used as the preferred network architecture for the distilled network (indicated in bold letters in Fig. 4). This enables a direct comparison between the distilled network and the best performing network present in the ensemble. For each dataset, two distilled networks were trained: one from the diverse ensemble and one from the uniform ensemble. These networks were trained with the combined loss function that includes soft and hard labels [Eq. (1)] and the network parameters were initialized with those of the best performing network in the ensemble.”)
, in response to the current performance metric differing from the previous performance metric by more than a threshold margin; and determines that the smaller deep learning ensemble is complete, in response to the current performance metric differing from the previous performance metric by less than the threshold margin. 
 (Pg. 13 of Noothout states “The results show that our ensembles and distilled networks obtain Dice coefficients and HDs which were close to the results of the best performing method. Overall, our diverse ensemble and distilled network obtain slightly better results compared to their uniform counter parts. For segmentation of the aorta, heart, trachea, and esophagus differences in Dice coefficients between the distilled network from the diverse ensemble and the best performing method were <0.01, while differences in HDs were below 0.10 mm, which is less than a voxel in size.” Noothout gives an explicit metric difference compared to a numeric margin. When difference is less than a threshold margin like 0.01 it means it is complete as it means best performing.)
It would have been obvious to one with ordinary skill in the art before the effective filing date of the invention to combine the teachings of Shamir, Walawalkar, and Noothout. Shamir teaches distilling a deep learning ensemble into a smaller neural network for inference using a distillation loss based on ensemble outputs and training data. Walawalkar teaches training student networks from ensemble teachers using a loss that includes a supervised term based on ground truth labels and distillation term based on teacher outputs. Noothout teaches using knowledge distillation in medical device to produce smaller models that are faster and easier to deploy without sacrificing performance. One with ordinary skill in the art would be motivated to incorporate the teachings of Noothout into the combination of Shamir and Walawalkar to obtain a smaller footprint distilled model suitable for practical deployment and to terminate further distillation when additional iteration no longer provide improvements. The combination would have predictable improved performance of deployable distilled model with reduced inference cost. 

Regarding claim 6, the rejection of claim 5 is incorporated herein. Furthermore, the combination of Shamir, Walawalkar, and Noothout teaches
an execution component that deploys the smaller deep learning ensemble, in response to the distillation component determining that the smaller deep learning ensemble is complete. ([0022] of Shamir states “After training, the single neural network can be deployed to generate inferences. In such fashion, the single neural model can provide a superior prediction accuracy while, during training, the ensemble can serve to influence the single neural network to be more reproducible. In another example, both accuracy and reproducibility can be improved by distilling a combination of the wide model and the ensemble into a limited resources narrow model which is then deployed.”)

Regarding claim 7, the rejection of claim 1 is incorporated herein. Furthermore, the combination of Shamir, Walawalkar, and Noothout teaches
each neural network of the smaller deep learning ensemble exhibits a smaller footprint than each neural network of the deep learning ensemble. (Pg. 12 of Noothout states “Distilled networks contained 4 to 89 times fewer trainable parameters, needed 4 to 10 times less GPU memory at inference, and were 5 to 8 times faster compared to the ensembles.”)

Regarding claim 17, the combination of Shamir, Walawalkar, and Noothout teaches 
A computer program product for facilitating improved distillation of deep ensembles, the computer program product comprising a non-transitory computer-readable memory having program instructions embodied therewith, the program instructions executable by a processor to cause the processor to: access an ensemble of teacher networks hosted on a medical scanning device and a training dataset; ([0006] of Shamir states “The computing system includes one or more processors and one or more non-transitory computer-readable media that collectively store: an ensemble that comprises a plurality of neural networks;…. the operations include: accessing, by the computing system, one or more training examples; processing, by the computing system, each of the one or more training examples with the ensemble to obtain an ensemble output from the ensemble;” ,which the ensemble is executed to produce inferencing output. Pg. 7 of Noothout states “The SegTHOR dataset consists of radiotherapy treatment CT scans of the chest, while the brain dataset consists of brain MRI, and the ACDC dataset consists of cardiac cine-MRI. We deliberately selected datasets that differ in image modality (CT with and without contrast enhancement, MRI, and cine-MRI) and delineated anatomy (chest, brain, and heart).” Pg. 2 of Noothout states “Knowledge distillation has also been used for a few applications in medical image analysis, such as classification,20,21 detection,22,23 and segmentation8,24–29 tasks. Knowledge distillation has been applied to obtain a smaller student network from a large teacher network such as a Google Inception V3 network for automatic detection of invasive cancer in whole slide images,22 a U-Net for automatic segmentation of neurons in microscope images24, or for liver26,27 and kidney26 segmentation in computed tomography images.”)
iteratively train a condensed ensemble of student networks based on the ensemble of teacher networks and based on the training dataset, wherein each new student network of the condensed ensemble is trained via a loss that is based on all previously-trained student networks in the condensed ensemble; (Pg. 3 of Walawalkar states “We present a framework which enables multiple student model training using knowledge transfer from an ensemble teacher. Each of these student models represents a version of the original model compressed to different degrees. Knowledge is distilled onto each of the compressed student through the ensemble teacher and also through intermediate representations from a pseudo teacher.” Pg. 6 of Walawalkar states “During inference stage, any of the individual student models can be selected from the ensemble depending on the computational hardware constraints. In case of lenient constraints, the entire ensemble can be used with the ensemble teacher providing inference based on the learnt ensemble knowledge… The loss in every student’s representational capacity due to compression is countered, by making each student block try and learn the intermediate feature map of its corresponding pseudo teacher block. The feature map pairs are compared using traditional Mean Squared Error loss, on which the network is trained to reduce any differences between them.” Pg. 5 of Walawalkar states “For every successive student branch the number of channels in each layer of its blocks is reduced by a certain ratio with respect to the pseudo teacher. This ratio becomes higher for every new student branch created. For example, for a four student ensemble and C being number of channels in pseudo teacher, the channels in other three students are assigned to be 0.75C, 0.5C and 0.25C. The students are compressed versions of the original model to varying degrees, which still manage to maintain the original network depth. The channels in the common base block are kept the same as original whose main purpose is to provide constant low level features to all the student branches.” Walawalkar teaches creating multiple student branches (an ensemble of students) and progressively compressing them and selecting students for deployment under hardware constraints, like the distilling into a smaller ensemble.)
and replace the ensemble of teacher networks on the medical scanning device with the condensed ensemble of student networks. ([0042] of Shamir states “The present disclosure proposes techniques which obtain the benefits of both improved accuracy and improved reproducibility. In particular, instead of training and deploying an ensemble (which, as described above, has reduced accuracy, increased computational requirements, and higher maintenance and technical complexities relative to a single model of comparable size), the present disclosure proposes to train and deploy a single neural network. However, according to an aspect of the present disclosure, the single neural network can also be trained to be more reproducible by distilling from an ensemble to the single neural network during training of the single neural network” which will be deploying distilled one rather than original ensemble. )

Regarding claim 20, the rejection of claim 17 is incorporated herein. Furthermore, the combination of Shamir, Walawalkar, and Noothout teaches
wherein, for each of the previously-trained student networks, the loss includes a distance computed between hidden activation maps of that previously-trained student network and hidden activation maps of the new student network. (Pg. 6 of Walawalkar states “The loss in every student’s representational capacity due to compression is countered, by making each student block try and learn the intermediate feature map of its corresponding pseudo teacher block. The feature map pairs are compared using traditional Mean Squared Error loss, on which the network is trained to reduce any differences between them.” [0052] of Shamir states “In addition to distilling at the top output level, distillation can also be performed at any level in the network, including single neuron activation values. For example, instead of distilling the top head, an implementation can apply the distillation on the values of all neurons in a hidden layer, which are distilled from an average or some computation for that layer in the ensemble 20 to a matching layer in the student single tower 30.”)

Claims 13 – 15, 18 recite substantially similar subject matter as claim 5 – 7, and 5 respectively, and are rejected with the same rationale, mutatis mutandis.

Claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over Shamir et al. (U.S. Pub. 2021/0158156 A1) in view of Walawalkar et al. (NPL:”Online Ensemble Model Compression using Knowledge Distillation”),  Noothout et al. (NPL:”Knowledge distillation with ensembles of convolutional neural networks for medical image segmentation”), further in view of Tang et al. (NPL:”Understanding and Improving Knowledge Distillation).
Regarding claim 19, the rejection of claim 17 is incorporated herein. Furthermore, the combination of Shamir and Walawalkar does not teach
wherein, for each of the previously-trained student networks, the loss includes a cosine similarity computed between trainable internal parameters of that previously-trained student network and trainable internal parameters of the new student network.
However, Tang teaches that 
wherein, for each of the previously-trained student networks, the loss includes a cosine similarity computed between trainable internal parameters of that previously-trained student network and trainable internal parameters of the new student network. (Pg. 4 of Tang states “In (d), we show cosine similarities computed from the weights of the final logits layer.” Pg. 5 of Tang states “Thus, we create a distribution ρ sim as the softmax over cosine similarity3 of the weights: ρ sim = softmax(wˆtWˆ >), where Wˆ ∈ R K×d is the l2-normalized logit layer weights, and wˆt = wt/kwtk is the t-th row of Wˆ corresponding to the ground truth.”)
It would have been obvious to one with ordinary skill in the art before the effective filing date of the invention to combine the teachings of Shamir, Walawalkar, Noothout, and Tang. Shamir teaches distilling a deep learning ensemble into a smaller neural network for inference using a distillation loss based on ensemble outputs and training data. Walawalkar teaches training student networks from ensemble teachers using a loss that includes a supervised term based on ground truth labels and distillation term based on teacher outputs. Noothout teaches using knowledge distillation in medical device to produce smaller models that are faster and easier to deploy without sacrificing performance. Tang teaches using cosine similarity as a meaningful similarity measure between learned vectors. One with ordinary skill in the art would be motivated to incorporate the teachings of Tang into the combination of Shamir, Walawalkar, Noothout to use cosine similarity as it is known similarity measure method for computation efficiency in high dimensional representations. The combination would have been predictable to improve control over similarity between newly and previously trained student networks.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BYUNGKWON HAN whose telephone number is (571)272-5294. The examiner can normally be reached M-F: 9:00AM-6PM PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached at (571)272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/BYUNGKWON HAN/Examiner, Art Unit 2121                                                                                                                                                                                                        


/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121
Read full office action
DISTILLATION OF DEEP ENSEMBLES

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

DISTILLATION OF DEEP ENSEMBLES

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email