Last updated: May 29, 2026

Application No. 17/554,656

OPTIMAL KNOWLEDGE DISTILLATION SCHEME

Non-Final OA §103

Filed

Dec 17, 2021

Examiner

YOUNG, KEVIN L.

Art Unit

2194

Tech Center

2100 — Computer Architecture & Software

Assignee

Lemon Inc.

OA Round

2 (Non-Final)

This examiner grants 46% of cases after interview

— +65.1% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.

Based on 175 resolved cases, 2023–2026

Examiner Intelligence

YOUNG, KEVIN L. View full profile →

Grants 46% of resolved cases

Career Allowance Rate

81 granted / 175 resolved

-8.7% vs TC avg

Strong +65% interview lift

Without

With

+65.1%

Interview Lift

resolved cases with interview

Typical timeline

3y 10m

Avg Prosecution

1 currently pending

Career history

185

Total Applications

across all art units

Statute-Specific Performance

§101

0.8%

-39.2% vs TC avg

§103

94.2%

+54.2% vs TC avg

§102

4.6%

-35.4% vs TC avg

§112

0.4%

-39.6% vs TC avg

Black line = Tech Center average estimate • Based on career data from 175 resolved cases

Office Action

§103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3-4, 6-9, and 11-20 are rejected under 35 U.S.C. 103 as being unpatentable over Qiang ( CN Publication NO. 112132278, submitted in IDS on 11/28/2023) in view of Chen et al. (hereinafter Chen, “Distilling Knowledge via Knowledge Review,” submitted in IDS on 03/14/2022), in further view of Xiang Li et al. (hereinafter Li, US Publication No. 20210124881A1), in further view of Keith Raniere (hereinafter Raniere, US Publication No. 20040152064A1). Claims 2 and 10 have been canceled by the applicant.





Regarding Claim 1,
	Qiang suggests: re-training the student network by performing KD from the teacher network to the student network based at least in part on an optimized importance factor. Qiang teaches: “the training images are respectively input into the first backbone network and the second backbone network for feature extraction, and a first feature map output by the first backbone network and a second feature map output by the second backbone network are obtained. Feature map; calculate model loss based on the first feature map, the second feature map and the channel weight vector; the second backbone network is updated and optimized according to the model loss to obtain a compressed image recognition model.” (Qiang, [0012]-[0014]). Qiang uses the terms update and optimized as equivalent to re-training the student (second backbone) network, the model loss that this update is based on is equivalent to an importance factor.
	Qiang fails to explicitly teach: configuring the search space comprises adding a transform block after each feature map of the student network to which knowledge is transferred from at least one feature map of the teacher network, and wherein the transform block comprises convolutional layers and an interpolation layer.

	However, Chen discloses:
Chen suggests: configuring the search space comprises adding a transform block after each feature map of the student network to which knowledge is transferred from at least one feature map of the teacher network, and wherein the transform block comprises convolutional layers and an interpolation layer. Chen teaches: “The transformation Mi, j is simply composed of convolution layers and nearest interpolation layers to transfer the ith feature of the student to match the size of the teacher’s jth feature.” (Chen, at least, page 5011, column 1, section 3.2, lines 2-5). 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to incorporate the teachings of Qiang with that of Chen as (Chen, page 5011, column 2, final paragraph, “This residual learning process is more stable and effective than directly letting high-level features of the student learned form low-level features of the teacher”).

Qiang and Chen fail to explicitly teach: configuring a search space by establishing a plurality of pathways for transferring information from a teacher network [[and]] to a student network; assigning an importance factor to each of the plurality of pathways; and, training the student network to search for the optimal KD scheme, wherein the importance factor and parameters of the student network are updated during training of the network.

However, Li discloses:
	Li suggests: configuring a search space by establishing a plurality of pathways for transferring information from a teacher network [[and]] to a student network and assigning an importance factor to each of the plurality of pathways. Li teaches: “The technical solutions provided by the embodiments of the present disclosure can include the following beneficial effects. By additionally introducing a plurality of intermediate teacher models of less number of parameters, the training of the student models can be guided in multiple paths, and thus the knowledge of the original teacher model can be gradually transferred to the student models more effectively, and a student model with the best quality can be selected from the multiple student models generated based on the multiple paths as a final target student model, which improves the quality of the student models” (Li, [0011]). It should be noted that Li does not specifically teach bi-directional pathways, this will be covered later. Li is clearly teaching a plurality of pathways from a teacher to a student in a search space designed to select the student model with the “best quality”; wherein the term “best quality” is synonymous with the term “importance factor” used in the claim limitation.
	Li further suggests: training the student network to search for the optimal KD scheme, wherein the importance factor and parameters of the student network are updated during training of the network. Li teaches: “In an embodiment, the training sub-unit is further configured to train the corresponding candidate student model by using the source data as the input and the pseudo target data output by the original teacher model that has been trained as the verification data for the corresponding candidate student model when the training path starts from the original teacher model and directly arrives at the corresponding candidate student model, or to train the respective intermediate teacher models by using the source data as the input and the pseudo target data output by a preceding adjacent complex teacher model on the training path as the verification data and train the corresponding candidate student model by using the pseudo target data output by a preceding intermediate teacher model adjacent to the candidate student model on the training path as the verification data when the training path starts from the original teacher model, passes the at least one intermediate teacher model and arrives at the corresponding candidate student model. The complex teacher model is the original teacher model that has been trained, or another intermediate teacher model that has been trained and has the number of model parameters that is greater than that of the intermediate teacher model currently under training.” (Li, [0059]). Wherein the intermediate teacher models are student models themselves before becoming teacher models. 
	Li further teaches “In an embodiment, the target model selection unit 150 is further configured to test accuracy of output results of the multiple candidate student models through a set of verification data, and select the target student model according to the accuracy.” (Li, [0060]). Wherein the target student model selected as one among many is synonymous with the optimal KD scheme. 
	It would have been prima facie obvious to a person having ordinary skill in the art at the time of filing to modify the teachings of Qiang and Chen with those of Li with a reasonable expectation of success. As teaching multiple students with multiple paths is more efficient than training them all one by one and comparing the results after. 

	Qiang, Chen, and Li do not teach bi-directional pathways for transferring information from a teacher network [[and]] to a student network.
	However, Raniere suggests: bi-directional pathways for transferring information from a teacher network [[and]] to a student network. Raniere teaches “FIG. 1 shows a communication path 24 between the teacher 30 and student 31, and a communication path 26 between the teacher 30 and student 32. The communication path 24 is shown in FIG. 1 to point from the teacher 30 to the student 31 and also from the student 31 to the teacher 30, which indicates a bidirectional communication.” (Raniere, [0038]).
	It would have been prima facie obvious to a person having ordinary skill in the art to modify the teachings of Qiang, Chen, and Li with those of Raniere with a reasonable expectation of success. By definition pathways allow for the transfer of information to flow in both directions.
	

Regarding Claim 3, Chen teaches:
	The method of claim 1, wherein the searching the optimal KD scheme further comprises: training the student model on a training dataset with a training loss encoding a supervision from ground truth label information and the teacher network; and evaluating the trained student on a validation dataset, wherein a validation loss only measures a difference between an output of the student network and the ground truth label information.
Training datasets are common knowledge within the art and are frequently used for training purposes, hence the term training dataset. Supervision is also done through what are called logits (Chen, page 5008, Introduction, paragraph 3, “Knowledge distillation is first proposed in [9]. The process is to train a small network (also known as the student) under the supervision of a larger network (a.k.a. the teacher). In [9], knowledge is distilled though the teacher’s logit, which means the student is supervised by both ground truth labels and teacher’s logits.”). Validation loss calculations are a method of evaluating the effectiveness of the training. Anyone with ordinary skill in the art would know to test the student model with a validation dataset and calculate the difference between the output of the student and the ground truth label information. 
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to incorporate the teachings of Qiang with that of Chen to (Chen, page 5009, final paragraph of the introduction, “we achieve better performance in many computer vision tasks.”). Earlier in the introduction Chen covers the prior art and the improvements therein with the changes made and tested.
Claims 11 and 17 are equivalent to claim 3 but for the inclusion of generic computer components. Thus, they are rejected under the same merits. 



Regarding claim 4, the rejection of claim 3 is incorporated and furthermore Qiang teaches:
	The method of claim 3, further comprising: identifying the optimal KD scheme by optimizing the importance factor, the optimized importance factor minimizing the validation loss. 
The purpose of the importance factor is to point out the scheme that preforms the best, or, provides the lowest validation loss which Qiang calls model loss (Qiang, [0087]-[0089], “the model loss is calculated based on the first feature map, the second feature map and the channel weight vector, which specifically includes the following steps: S601: Use a predefined loss function to calculate the first feature map and the second feature map to obtain the feature map loss. S602: Based on the channel weight vector, weight the feature map loss to obtain the model loss.“). Here Qiang demonstrates the optimization of the importance factor by using it to calculate a model loss. 
The model loss is weighted and used to integrate the influence of the feature channel importance parameters (Qiang, [0091], “ when calculating the model loss, the feature loss and the channel weight vector are weighted so that the calculation of the model loss integrates the influence of the feature channel importance parameters, so that the model can be effectively reduced while compressing the model. Compression precision loss.”) Validation loss and precision loss are used synonymously in that validation is a calculation of precision.
Claims 12 and 18 are equivalent to claim 4 but for the inclusion of generic computer components. Thus, they are rejected under the same merits. 

Regarding claim 6, Qiang teaches:
	The method of claim 1, wherein the retraining the student network based at least in part on the optimized importance factor further comprises: retraining the student network using the optimized importance factor (and an entire set of data comprising a training dataset and a validation dataset used during the process of training the student network.)
It would be known to anyone of ordinary skill in the art that the purpose of retraining a student network is to improve the precision and accuracy of it compared to the teacher network via updating and optimization. In the claim 4 rejection Qiang shows the model loss as a form of an optimized importance factor (Qiang, [0018], “A channel weight calculation module, configured to calculate a channel weight vector based on the model test results; wherein the channel weight vector is used to describe the importance of the feature channel corresponding to the feature map output by the first backbone network;”). 
Qian then uses another module to retrain the student network or second backbone (Qiang, [0021], “ A model update module, configured to update and optimize the second backbone network according to the model loss to obtain a compressed image recognition model.”). 
Qiang fails to explicitly teach:
	Using an entire set of data comprising a training dataset and a validation dataset used during the process of training the student network. 
However, Chen teaches:
	an entire set of data comprising a training dataset and a validation dataset used during the process of training the student network
A training dataset and a validation set are used together for training and re-training a model (Chen, page 5013, column 1, section 4.1, paragraph 1, “Datasets (1) CIFAR-100 contains 50K training images with 0.5K images per class and 10K test images. (2) ImageNet [3] is the most challenging dataset for classification, which provides 1.2 million images for training and 50K images for validation over 1,000 classes.”). 
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to incorporate the teachings of Qiang with that of Chen as (Chen, page 5015, column 1, section 4.3, line 8, “The models are trained on the COCO2017 training set and are evaluated on the validation set.”). The use of a validation set is to evaluate the accuracy and performance of the trained or student network ensuring a higher accuracy and performance as the student is trained.

Claims 13 and 19 are equivalent to claim 6 but for the inclusion of generic computer components. Thus, they are rejected under the same merits.

Regarding claim 7, the rejection of claim 1 is incorporated and further more Qiang teaches:
	The method of claim 6, further comprising: retraining the student network using a same importance factor for each iteration of the retraining process, wherein the same importance factor is obtained at a last iteration during the process of training the student network.
	“Use a mask layer to perform channel masking process on the same feature channel in each test feature map to obtain a third feature map corresponding to each image to be tested.” (Qiang, [0060]
Using the same parameter to retrain a series of iterations of the student network, while masking is commonly used to “erase” or “zero” out the data it can be used to apply an importance factor or weight to the feature channels. 
Claims 14 and 20 are equivalent to claim 7 but for the inclusion of generic computer components. Thus, they are rejected under the same merits. Claim 20 has other additional elements that are found in the rejection of claim 8 and is therefore also rejected under the merits of claim 8.

Regarding claim 8, the rejection of claim 1 is incorporated and furthermore Qiang teaches:
	The method of claim 6 further comprising: retraining the student network using different importance factors for each iteration of the retraining process, the different importance factors computed using linear interpolation.
	“The test feature map refers to the feature map corresponding to the image to be tested that is output by feature extraction of the image to be tested through the first backbone network. 
-Specifically, feature extraction is performed on each image to be tested through the first backbone network, that is, through multi-layer convolution, activation, pooling and other nonlinear transformations, a test feature map corresponding to each image to be tested is output 
-The test feature map includes multiple feature channels, and different feature channels reflect different image features,” (Qiang, [0059]
Here Qiang is teaching the use of multiple different feature channels to test different image features. 
Claims 15 and 20 are equivalent to claim 8 but for the inclusion of generic computer components. Thus, they are rejected under the same merits. Claim 20 has additional elements that are found in the rejection of claim 7 and is therefore also rejected under the merits of claim 7.  

Allowable Subject Matter
Claim 5 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Response to Arguments
Applicant’s arguments filed 9/11/2025 in regards to rejections under 35 USC § 101 as directed to an abstract idea have been fully considered and are persuasive.
Applicant’s arguments with respect to rejection of claims under 35 USC § 103 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. See
Prior art made of record:
Mermound et al., U.S. PG-Publication No. 2023/0164029 A1.                                                                                                                                                                                  
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MEGHAN A RIEHL whose telephone number is (571)272-6412. The examiner can normally be reached Mon-Thurs 9:30-6:30 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kevin Young can be reached at (571)270-3180. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MEGHAN ANNE RIEHL/Examiner, Art Unit 2194                                                                                                                                                                                                        /KEVIN L YOUNG/Supervisory Patent Examiner, Art Unit 2194

Read full office action

Prosecution Timeline

Dec 17, 2021

Application Filed

Jun 11, 2025

Non-Final Rejection mailed — §103

Sep 11, 2025

Response Filed

Oct 28, 2025

Final Rejection mailed — §103

Dec 17, 2025

Response after Non-Final Action

Precedent Cases

Applications granted by this same examiner with similar technology

17/526,875

Patent 12547442

AD-HOC PROXY FOR BATCH PROCESSING TASK

4y 2m to grant Granted Feb 10, 2026

17/686,262

Patent 12511163

System for Predicting Memory Resources and Scheduling Jobs in Service-Oriented Architectures Using Database Processors and a Job Ingestion Processor

3y 10m to grant Granted Dec 30, 2025

12/839,131

Patent 9626339

User Interface with Navigation Controls for the Display or Concealment of Adjacent Content

6y 9m to grant Granted Apr 18, 2017

13/750,365

Patent 9613131

ADJUSTING SEARCH RESULTS BASED ON USER SKILL AND CATEGORY INFORMATION

4y 2m to grant Granted Apr 04, 2017

13/831,471

Patent 9589062

DURABLE MEMENTO SYSTEM

3y 11m to grant Granted Mar 07, 2017

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

2-3

Expected OA Rounds

46%

Grant Probability

99%

With Interview (+65.1%)

3y 10m (~0m remaining)

Median Time to Grant

Moderate

PTA Risk

Based on 175 resolved cases by this examiner. Grant probability derived from career allowance rate.