Office Action Analysis: 18090993 — SYSTEM AND METHOD FOR FEDERATED LEARNING FOR AUTOMOTIVE APPLICATION WITH KNOWLEDGE DISTILLATION BY TEACHER MODEL

Examiner Intelligence

GALVIN-SIEBENALER, PAUL MICHAEL View full profile →
Grants only 20% of cases
Career Allowance Rate
1 granted / 5 resolved
-35.0% vs TC avg
Minimal +0% lift
Without
With
+0.0%
Interview Lift
resolved cases with interview
Typical timeline
3y 9m
Avg Prosecution
38 currently pending
Career history
43
Total Applications
across all art units
Statute-Specific Performance

§101
29.2%
-10.8% vs TC avg
§103
38.2%
-1.8% vs TC avg
§102
18.3%
-21.7% vs TC avg
§112
14.4%
-25.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 5 resolved cases
Office Action

§103
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This action is in response to the amendment filed on Dec. 15th, 2025. The amendments are linked to the original application filed on Dec. 29th, 2022.

Response to Amendment
The Examiner thanks the applicant for the remarks, edits and arguments.
Regarding Claim Rejections – 35 U.S.C. 103
Applicant Remarks:
	The applicant states that the proposed art fails to teach the amendments made to the independent claims. Further the applicant states that the examiner proposed art, Wu, fails to properly disclose the elements of claims 6, 13 and 19.
The applicant states that Jin and Nguyen fail to teach each and every element of the amended claims. Further, because of claim dependency, the art proposed for the dependent claims would fail to cure the deficiencies of the art proposed for the independent claims. Therefore, the applicant believes the rejection under 35 U.S.C. 103 should be withdrawn for all claims.

Examiner Response:
	The applicant argues that Jin and Nguyen fail to teach or disclose the amended claims. Since these are amendments, these arts were not used to predict future limitations. Further consideration and evaluation of the amended independent claims is required and is performed after each submitted amendment.
	Next the applicant argues that Wu fails to teach elements of claims 6, 13 and 19. The applicant has also cancelled these claims making any arguments for or against Wu moot. Further the examiner has reviewed the amendments and Wu would no longer apply to the amended claims and therefore is no longer used by the examiner for this application.
	Finally, as stated above, the examiner must perform a complete search after each amendment. The claims, remarks, specification, and previous rejection have been reconsidered as well. The examiner would like to note that since the claims have been amended some of the previously presented art no longer applies and the examiner no longer relies on Wu, Nguyen or Zhao. The examiner would like to note that these arts no longer apply to the current amended claims but may be used again if the applicants wishes to amend the claims and those amendments align with what is taught by the previously presented arts. After the search was conducted, the examiner reconsidered the submitted material and the examiner believes a combination of previously presented art and new art has been found which teaches the amended claims. Therefore, the claim rejection under 35 U.S.C. 103 is upheld, see 103 rejection below.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-5, 8-12, and 15-18 are rejected under 35 U.S.C. 103 as being unpatentable over Jin et al., (Jin et al., “Personalized Edge Intelligence via Federated Self-Knowledge Distillation”, Nov. 28th, 2022, hereinafter “Jin”) in view of Kothandaraman et al., (Kothandaraman et al., “Domain Adaptive Knowledge Distillation for Driving Scene Semantic Segmentation”, 2021, hereinafter “Kothandaraman”).

Regarding claim 1, Jin discloses, “A method, implemented by programmed one or more processors, comprising:” (Algorithm 1 and Algorithm 2, pp. 570; Both of the algorithms shown in this article disclose the method, used by computer, to execute the proposed system. Both algorithms disclose pseudo code which is used to implement this system on a generic computing device containing some form of processors which are linked to a form of memory and 1/0 devices.)
“receiving, from one or more server computers through a communication network, a first model;” (Figure 3, pp. 569; This figure discloses the workflow of the proposed system. This system uses federated learning and knowledge distillation. As seen in the figure, client k will receive a global model which is to be the student model while retaining a second teacher model.)
“training the first model based on the training dataset; and” (The Proposed pFedSD, pp. 570; "To transfer the knowledge from the past model to the current local model, there are many schemes, such as knowledge distillation, parameter regularization. Here, we employ knowledge distillation from the most recently updated personalized model vk during the client updates phase." Once an update is received at the client device, training begins. The local model will use knowledge of the previous model and the global model to train itself. This model is also trained on the local data contained at the client. Once the model is trained then the updated model is stored and a copy is sent to the global server for further aggregation.)
“transmitting first data representing the trained first model to the one or more server computers though the communication network.” (Figure 3, pp. 569; As seen in step 4 of the figure. Step 4 states, "clients send back the updated local model                         
                            
                                    w
                                
                                    k
                                
                                    t
                                    +
                                    1
                                
                     to the server;")
Jin fails to explicitly disclose the remaining limitations of this claim. However, Kothandaraman discloses, “collecting sensor data acquired by a sensor on a vehicle;” (Introduction, pp. 135; “We evaluate our proposed method on large-scale autonomous driving datasets. In addition to the benchmark synthetic-to-real adaptation case scenario (GTA5 to Cityscapes), we assess our pipeline on a real-to-real scenario too (BDD to Cityscapes).” This model will input data from known autonomous driving datasets. These are used to train and test the models.) and (Experimental Setup, pp. 138; “We evaluate the proposed algorithm on three popular large-scale autonomous driving datasets. The Cityscapes (CS) [11] dataset and Berkeley Deep Drive (BDD) [44] dataset are real scenes captured in Europe and the USA respectively. In particular, BDD captures varying illumination conditions, seasonal changes. GTA5 [30] is a popular synthetic driving dataset, and majorly emerged out of computer games.” This model is able to actual autonomous driving datasets. These datasets include images and videos.)
“inputting a first data item from among the collected sensor data to both the first model and a trained second model and comparing outputs of a first model and a second model;” (Figure 1, pp. 136; This figure shows the input scenes on the left and they are input into the student and the teach model) and (Overview of the Network Architecture, pp. 136; “RGB images from both source and target domains and the segmentation maps for the corresponding source domain images are the inputs to the networks.” The image data is input in to both of the models, the teacher and the student networks.)
“based on a difference between the outputs being greater than a predetermined value, identifying the first data item for training the first model;” (KL divergence loss                         
                            
                                    L
                                
                                    K
                                    L
                                
                    , pp. 137; “The output of the networks are probability distributions. We use the KL divergence loss [19] to motivate the student to achieve distributions close to the teacher at the output level. This encourages the output of the student network to emulate the output of the teacher network. [see equation (3)] Here,                         
                            
                                    λ
                                
                                    K
                                    L
                                
                     represents the weight hyperparameter for KL divergence distillation,                         
                            
                                    q
                                
                                    i
                                
                                    s
                                
                     denotes student domain probability maps and                         
                            
                                    q
                                
                                    i
                                
                                    t
                                
                     denotes the corresponding counterpart for teacher domain.” This model uses KL divergence to measure the difference between the outputs of the student and teacher models. The                         
                            
                                    λ
                                
                                    K
                                    L
                                
                     represents the learning rate. The learning rate is used to ensure the model does not overfit the training parameters.)
“deriving an inference signal by running the trained second model using the first data item as input to the second model to provide a training dataset that contains the identified first data item and the derived inference signal as a supervision signal corresponding to the identified first data item;” (Domain-adaptive distillation with pseudo teacher labels                         
                            
                                    L
                                
                                    p
                                    s
                                    e
                                    u
                                    d
                                    o
                                    T
                                
                    , pp. 137; “This distillation happens at the output space, in addition to the distillation using KL-divergence loss. Based on the soft labels (probability maps) provided by the teacher, we determine the class that each pixel belongs to generate pseudo labels.” This model will use the knowledge of the teacher model to label unlabeled data. The labeled data will then be used by the student model for further training.) and (Domain-adaptive distillation with pseudo teacher labels                         
                            
                                    L
                                
                                    p
                                    s
                                    e
                                    u
                                    d
                                    o
                                    T
                                
                    , pp. 138; “The enchantment of these pseudo-labels is mainly exhibited in the case of the student target domain, wherein teacher domain pseudo-labels serve as a proxy for the ground truth. On this notion, we apply cross entropy loss between the student network target domain outputs and the corresponding pseudo-labels from teacher, that we term as                         
                            
                                    L
                                
                                    p
                                    s
                                    e
                                    u
                                    d
                                    o
                                    T
                                
                    .” The model will use the pseudo-labels to train the student model. The pseudo-labels are generated by the teacher model from the input data to further train the student model.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Jin and Kothandaraman. Jin teaches a federated machine learning system where the edge devices perform knowledge distillation using the previous local model as the teacher model and the aggregated and updated model from the server as the student model. Kothandaraman teaches a machine learning system that uses adaptive learning and knowledge distillation to train student model using multiple loss functions for object detection in autonomous vehicles. One of ordinary skill would have motivation to combine a federated learning system which uses knowledge distillation and substitute the knowledge distillation model with another knowledge distillation model which uses multiple loss functions to detect objects for autonomous driving, “As we can observe in Table 1, source distillation (case (a)) and target distillation (case (b)) outperform the corresponding undistilled student network by 10.43 % and 14.08 % (4 and 5.4 mIoU points in terms of absolute numbers) respectively. Target distillation performs better than source distillation. This can be ascribable to the fact that distilling only in the target domain reduces the bias of the model towards the source domain, and improves performance in the target domain. In addition, our network in case (b) (target distillation) outperforms even the teacher network by 3.30 % (1.4 mIoU points in terms of absolute numbers). Further, in concurrence with our intuition, case (c) (where both source and target distillation are done simultaneously) performs better than both case (a) and case (b), achieving an mIoU of 43.97 (14.71 % and 3.87 % relative improvement over the student and teacher networks respectively). Among all the distillation case studies, we observe that case (d) performs the best with a improvement of 15.18 % and 4.29 % (5.82 and 1.82 mIoU points respectively) over the student and teacher networks respectively.” (Kothandaraman, Rael to real adaptation: Berkley Deep Drive to Cityscapes, pp. 139).

Regarding claim 2, Jin discloses, “receiving, from the one or more server computers through the communication network, second data that represents a model that is trained with aggregated model information from other edge models; and” (Figure 5, pp. 569; "5 the server aggregates all received local models to obtain a new global model                         
                            
                                    w
                                
                                    t
                                    +
                                    1
                                
                    " After the local models have been trained, the new model is sent to a global server. At the global server the models from the different clients are aggregated to form a new global model.)
“updating the first model based on the second data.” (Algorithm 2, pp. 570; After the global server aggregates the local model updates a new global model is formed. This global model is then sent to all of the local devices to update their local models. This process is seen in algorithm 2 at lines 1-7.)

Regarding claim 3, Jin discloses, “wherein the training the first model comprises training a copy of the received first model.” (Algorithm 2, pp. 570; After the global server aggregates the local model updates a new global model is formed. This global model is then sent to all of the local devices to update their local models. The local clients will train this model and use it as their own. The global model is still stored at the global; server and the clients are only updating their local student models, which represent a copy of the global model. Once training is complete the local model is stored locally, see line 10, which is separate to the global model.)

Regarding claim 4, Jin discloses, “obtaining, as the first data, a gradient between the first model prior to the training and the first model subsequent to the training.” (The Proposed pFedSD, pp. 570; “Here the hyperparameter λ controls the contributions of knowledge distillation.                         
                            
                                    F
                                
                                    k
                                
                                    ⋅
                                
                     denotes the cross entropy (CE) loss of client k.                         
                            
                                    L
                                
                                    K
                                    L
                                
                     denotes the Kullback-Leibler (KL) divergence function between past personalized prediction                         
                            q
                            
                                            v
                                        
                                            k
                                        
                     and current local prediction                         
                            q
                            
                                            w
                                        
                                            k
                                        
                                            t
                                        
                    .” This process discloses the loss function and how and when the models are trained. This method will compare the previous model and the current model to determine a distance between the two models. If there is a large difference than more training is needed.) 

Regarding claim 5, Jin discloses, “obtaining, as the first data, a gradient between the received first model and the copy of the first model that is updated by the training.” (The Proposed pFedSD, pp. 570; “Here the hyperparameter λ controls the contributions of knowledge distillation.                         
                            
                                    F
                                
                                    k
                                
                                    ⋅
                                
                     denotes the cross entropy (CE) loss of client k.                         
                            
                                    L
                                
                                    K
                                    L
                                
                     denotes the Kullback-Leibler (KL) divergence function between past personalized prediction                         
                            q
                            
                                            v
                                        
                                            k
                                        
                     and current local prediction                         
                            q
                            
                                            w
                                        
                                            k
                                        
                                            t
                                        
                    .” This process discloses the loss function and how and when the models are trained. This method will compare the previous model and the current model to determine a distance between the two models. If there is a large difference than more training is needed. This system uses the prior student model as the teacher model. Using the broadest reasonable interpretation the teacher model would be considered the "copy of the first model" since it was once a copy of the global model, which was trained on local the local dataset.) 

Regarding claim 8, Jin discloses, “A computing device, comprising: a memory storing instructions; and a processor configured to execute the instructions to:” (Algorithm 1 and Algorithm 2, pp. 570; Both of the algorithms shown in this article disclose the method, used by computer, to execute the proposed system. Both algorithms disclose pseudo code which is used to implement this system on a generic computing device containing some form of processors which are linked to a form of memory and 1/0 devices.)
“receive, from one or more server computers through a communication network, a first model;” (Figure 3, pp. 569; This figure discloses the workflow of the proposed system. This system uses federated learning and knowledge distillation. As seen in the figure, client k will receive a global model which is to be the student model while retaining a second teacher model.)
“train the first model based on the training dataset; and” (The Proposed pFedSD, pp. 570; "To transfer the knowledge from the past model to the current local model, there are many schemes, such as knowledge distillation, parameter regularization. Here, we employ knowledge distillation from the most recently updated personalized model vk during the client updates phase." Once an update is received at the client device, training begins. The local model will use knowledge of the previous model and the global model to train itself. This model is also trained on the local data contained at the client. Once the model is trained then the updated model is stored and a copy is sent to the global server for further aggregation.)
“transmit first data representing the trained first model to the one or more server computers though the communication network.” (Figure 3, pp. 569; As seen in step 4 of the figure. Step 4 states, "clients send back the updated local model                         
                            
                                    w
                                
                                    k
                                
                                    t
                                    +
                                    1
                                
                     to the server;")
Jin fails to explicitly disclose the remaining limitations of this claim. However, Kothandaraman discloses, “collect sensor data acquired by a sensor on a vehicle;” (Introduction, pp. 135; “We evaluate our proposed method on large-scale autonomous driving datasets. In addition to the benchmark synthetic-to-real adaptation case scenario (GTA5 to Cityscapes), we assess our pipeline on a real-to-real scenario too (BDD to Cityscapes).” This model will input data from known autonomous driving datasets. These are used to train and test the models.) and (Experimental Setup, pp. 138; “We evaluate the proposed algorithm on three popular large-scale autonomous driving datasets. The Cityscapes (CS) [11] dataset and Berkeley Deep Drive (BDD) [44] dataset are real scenes captured in Europe and the USA respectively. In particular, BDD captures varying illumination conditions, seasonal changes. GTA5 [30] is a popular synthetic driving dataset, and majorly emerged out of computer games.” This model is able to actual autonomous driving datasets. These datasets include images and videos.)
“input a first data item from among the collected sensor data to both a first model and a trained second model and compare outputs of the first model and the second model;” (Figure 1, pp. 136; This figure shows the input scenes on the left and they are input into the student and the teach model) and (Overview of the Network Architecture, pp. 136; “RGB images from both source and target domains and the segmentation maps for the corresponding source domain images are the inputs to the networks.” The image data is input in to both of the models, the teacher and the student networks.)
“based on a difference between the outputs being greater than a predetermined value, identify the first data item for training the first model;” (KL divergence loss                         
                            
                                    L
                                
                                    K
                                    L
                                
                    , pp. 137; “The output of the networks are probability distributions. We use the KL divergence loss [19] to motivate the student to achieve distributions close to the teacher at the output level. This encourages the output of the student network to emulate the output of the teacher network. [see equation (3)] Here,                         
                            
                                    λ
                                
                                    K
                                    L
                                
                     represents the weight hyperparameter for KL divergence distillation,                         
                            
                                    q
                                
                                    i
                                
                                    s
                                
                     denotes student domain probability maps and                         
                            
                                    q
                                
                                    i
                                
                                    t
                                
                     denotes the corresponding counterpart for teacher domain.” This model uses KL divergence to measure the difference between the outputs of the student and teacher models. The                         
                            
                                    λ
                                
                                    K
                                    L
                                
                     represents the learning rate. The learning rate is used to ensure the model does not overfit the training parameters.)
“derive an inference signal by running the trained second model using the first data item as input to the second model to provide a training dataset that contains the identified first data item and the derived inference signal as a supervision signal corresponding to the identified first data item;” (Domain-adaptive distillation with pseudo teacher labels                         
                            
                                    L
                                
                                    p
                                    s
                                    e
                                    u
                                    d
                                    o
                                    T
                                
                    , pp. 137; “This distillation happens at the output space, in addition to the distillation using KL-divergence loss. Based on the soft labels (probability maps) provided by the teacher, we determine the class that each pixel belongs to generate pseudo labels.” This model will use the knowledge of the teacher model to label unlabeled data. The labeled data will then be used by the student model for further training.) and (Domain-adaptive distillation with pseudo teacher labels                         
                            
                                    L
                                
                                    p
                                    s
                                    e
                                    u
                                    d
                                    o
                                    T
                                
                    , pp. 138; “The enchantment of these pseudo-labels is mainly exhibited in the case of the student target domain, wherein teacher domain pseudo-labels serve as a proxy for the ground truth. On this notion, we apply cross entropy loss between the student network target domain outputs and the corresponding pseudo-labels from teacher, that we term as                         
                            
                                    L
                                
                                    p
                                    s
                                    e
                                    u
                                    d
                                    o
                                    T
                                
                    .” The model will use the pseudo-labels to train the student model. The pseudo-labels are generated by the teacher model from the input data to further train the student model.)

Regarding claim 9, Jin discloses, “receive, from the one or more server computers through the communication network, second data that represents a model that is trained with aggregated model information from other edge models; and” (Figure 5, pp. 569; "5 the server aggregates all received local models to obtain a new global model                         
                            
                                    w
                                
                                    t
                                    +
                                    1
                                
                    " After the local models have been trained, the new model is sent to a global server. At the global server the models from the different clients are aggregated to form a new global model.)
“update the first model based on the second data.” (Algorithm 2, pp. 570; After the global server aggregates the local model updates a new global model is formed. This global model is then sent to all of the local devices to update their local models. This process is seen in algorithm 2 at lines 1-7.) 

Regarding claim 10, Jin discloses, “wherein the instructions to train the first model comprises instructions to train a copy of the received first model.” (Algorithm 2, pp. 570; After the global server aggregates the local model updates a new global model is formed. This global model is then sent to all of the local devices to update their local models. The local clients will train this model and use it as their own. The global model is still stored at the global; server and the clients are only updating their local student models, which represent a copy of the global model. Once training is complete the local model is stored locally, see line 10, which is separate to the global model.) 

Regarding claim 11, Jin discloses, “wherein the processor is further configured to execute the instructions to obtain, as the first data, a gradient between the first model prior to the training and the first model subsequent to the training.” (The Proposed pFedSD, pp. 570; “Here the hyperparameter λ controls the contributions of knowledge distillation.                         
                            
                                    F
                                
                                    k
                                
                                    ⋅
                                
                     denotes the cross entropy (CE) loss of client k.                         
                            
                                    L
                                
                                    K
                                    L
                                
                     denotes the Kullback-Leibler (KL) divergence function between past personalized prediction                         
                            q
                            
                                            v
                                        
                                            k
                                        
                     and current local prediction                         
                            q
                            
                                            w
                                        
                                            k
                                        
                                            t
                                        
                    .” This process discloses the loss function and how and when the models are trained. This method will compare the previous model and the current model to determine a distance between the two models. If there is a large difference than more training is needed.)

Regarding claim 12, Jin discloses, “wherein the processor is further configured to execute the instructions to obtain, as the first data, a gradient between the received first model and the copy of the first model that is updated by the training.” (The Proposed pFedSD, pp. 570; “Here the hyperparameter λ controls the contributions of knowledge distillation.                         
                            
                                    F
                                
                                    k
                                
                                    ⋅
                                
                     denotes the cross entropy (CE) loss of client k.                         
                            
                                    L
                                
                                    K
                                    L
                                
                     denotes the Kullback-Leibler (KL) divergence function between past personalized prediction                         
                            q
                            
                                            v
                                        
                                            k
                                        
                     and current local prediction                         
                            q
                            
                                            w
                                        
                                            k
                                        
                                            t
                                        
                    .” This process discloses the loss function and how and when the models are trained. This method will compare the previous model and the current model to determine a distance between the two models. If there is a large difference than more training is needed. This system uses the prior student model as the teacher model. Using the broadest reasonable interpretation the teacher model would be considered the "copy of the first model" since it was once a copy of the global model, which was trained on local the local dataset.)

Regarding claim 15, Jin discloses, “A non-transitory computer-readable medium storing instructions, the instructions comprising: one or more instructions that, when executed by one or more processors of a device, cause the one or more processors to:” (Experimental Setup, pp. 573; "We implement all baselines mentioned above in PyTorch. We simulate the server and a set of clients in a multiprocessing manner and adopt MPI as the communication backend. All experiments are conducted on a deep learning server equipped with four V100 GPUs." Pytorch is open-source machine learning software, which is designed for use on a generic computer. Further this article states their system was executed on servers equipped with GPU's. This system, therefore, was executed on a system containing processing devices, memory systems which contain instructions to be executed and transmission devices to transmit the data.)
“receive, from one or more server computers through a communication network, a first model;” (Figure 3, pp. 569; This figure discloses the workflow of the proposed system. This system uses federated learning and knowledge distillation. As seen in the figure, client k will receive a global model which is to be the student model while retaining a second teacher model.)
“train the first model based on the training dataset; and” (The Proposed pFedSD, pp. 570; "To transfer the knowledge from the past model to the current local model, there are many schemes, such as knowledge distillation, parameter regularization. Here, we employ knowledge distillation from the most recently updated personalized model vk during the client updates phase." Once an update is received at the client device, training begins. The local model will use knowledge of the previous model and the global model to train itself. This model is also trained on the local data contained at the client. Once the model is trained then the updated model is stored and a copy is sent to the global server for further aggregation.)
“transmit first data representing the trained first model to the one or more server computers though the communication network.” (Figure 3, pp. 569; As seen in step 4 of the figure. Step 4 states, "clients send back the updated local model                         
                            
                                    w
                                
                                    k
                                
                                    t
                                    +
                                    1
                                
                     to the server;")
Jin fails to explicitly disclose the remaining limitations of this claim. However, Kothandaraman discloses, “collect sensor data acquired by a sensor on a vehicle;” (Introduction, pp. 135; “We evaluate our proposed method on large-scale autonomous driving datasets. In addition to the benchmark synthetic-to-real adaptation case scenario (GTA5 to Cityscapes), we assess our pipeline on a real-to-real scenario too (BDD to Cityscapes).” This model will input data from known autonomous driving datasets. These are used to train and test the models.) and (Experimental Setup, pp. 138; “We evaluate the proposed algorithm on three popular large-scale autonomous driving datasets. The Cityscapes (CS) [11] dataset and Berkeley Deep Drive (BDD) [44] dataset are real scenes captured in Europe and the USA respectively. In particular, BDD captures varying illumination conditions, seasonal changes. GTA5 [30] is a popular synthetic driving dataset, and majorly emerged out of computer games.” This model is able to actual autonomous driving datasets. These datasets include images and videos.)
“input a first data item from among the collected sensor data to both a first model and a trained second model and compare outputs of the first model and the second model;” (KL divergence loss                         
                            
                                    L
                                
                                    K
                                    L
                                
                    , pp. 137; “The output of the networks are probability distributions. We use the KL divergence loss [19] to motivate the student to achieve distributions close to the teacher at the output level. This encourages the output of the student network to emulate the output of the teacher network. [see equation (3)] Here,                         
                            
                                    λ
                                
                                    K
                                    L
                                
                     represents the weight hyperparameter for KL divergence distillation,                         
                            
                                    q
                                
                                    i
                                
                                    s
                                
                     denotes student domain probability maps and                         
                            
                                    q
                                
                                    i
                                
                                    t
                                
                     denotes the corresponding counterpart for teacher domain.” This model uses KL divergence to measure the difference between the outputs of the student and teacher models. The                         
                            
                                    λ
                                
                                    K
                                    L
                                
                     represents the learning rate. The learning rate is used to ensure the model does not overfit the training parameters.)
“based on a difference between the outputs being greater than a predetermined value, identify the first data item for training the first model;” (KL divergence loss                         
                            
                                    L
                                
                                    K
                                    L
                                
                    , pp. 137; “The output of the networks are probability distributions. We use the KL divergence loss [19] to motivate the student to achieve distributions close to the teacher at the output level. This encourages the output of the student network to emulate the output of the teacher network. [see equation (3)] Here,                         
                            
                                    λ
                                
                                    K
                                    L
                                
                     represents the weight hyperparameter for KL divergence distillation,                         
                            
                                    q
                                
                                    i
                                
                                    s
                                
                     denotes student domain probability maps and                         
                            
                                    q
                                
                                    i
                                
                                    t
                                
                     denotes the corresponding counterpart for teacher domain.” This model uses KL divergence to measure the difference between the outputs of the student and teacher models. The                         
                            
                                    λ
                                
                                    K
                                    L
                                
                     represents the learning rate. The learning rate is used to ensure the model does not overfit the training parameters.)
“derive an inference signal by running the trained second model using the first data item as input to the second model to provide a training dataset that contains the identified first data item and the derived inference signal as a supervision signal corresponding to the identified first data item;” (Domain-adaptive distillation with pseudo teacher labels                         
                            
                                    L
                                
                                    p
                                    s
                                    e
                                    u
                                    d
                                    o
                                    T
                                
                    , pp. 137; “This distillation happens at the output space, in addition to the distillation using KL-divergence loss. Based on the soft labels (probability maps) provided by the teacher, we determine the class that each pixel belongs to generate pseudo labels.” This model will use the knowledge of the teacher model to label unlabeled data. The labeled data will then be used by the student model for further training.) and (Domain-adaptive distillation with pseudo teacher labels                         
                            
                                    L
                                
                                    p
                                    s
                                    e
                                    u
                                    d
                                    o
                                    T
                                
                    , pp. 138; “The enchantment of these pseudo-labels is mainly exhibited in the case of the student target domain, wherein teacher domain pseudo-labels serve as a proxy for the ground truth. On this notion, we apply cross entropy loss between the student network target domain outputs and the corresponding pseudo-labels from teacher, that we term as                         
                            
                                    L
                                
                                    p
                                    s
                                    e
                                    u
                                    d
                                    o
                                    T
                                
                    .” The model will use the pseudo-labels to train the student model. The pseudo-labels are generated by the teacher model from the input data to further train the student model.) 

Regarding claim 16, Jin discloses, “wherein the instructions further comprise: one or more instructions that, when executed by one or more processors of the device, cause the one or more processors to: receive, from the one or more server computers through the communication network, second data that represents a model that is trained with aggregated model information from other edge models; and” (Figure 5, pp. 569; "5 the server aggregates all received local models to obtain a new global model                         
                            
                                    w
                                
                                    t
                                    +
                                    1
                                
                    " After the local models have been trained, the new model is sent to a global server. At the global server the models from the different clients are aggregated to form a new global model.)
“update the first model based on the second data.” (Algorithm 2, pp. 570; After the global server aggregates the local model updates a new global model is formed. This global model is then sent to all of the local devices to update their local models. This process is seen in algorithm 2 at lines 1-7.) 

Regarding claim 17, Jin discloses, “wherein causing the one or more processors to train the first model comprises causing the one or more processors to train a copy of the received first model.” (Algorithm 2, pp. 570; After the global server aggregates the local model updates a new global model is formed. This global model is then sent to all of the local devices to update their local models. The local clients will train this model and use it as their own. The global model is still stored at the global; server and the clients are only updating their local student models, which represent a copy of the global model. Once training is complete the local model is stored locally, see line 10, which is separate to the global model.) 

Regarding claim 18, Jin discloses, “wherein the instructions further comprise: one or more instructions that, when executed by one or more processors of the device, cause the one or more processors to obtain, as the first data, a gradient between the first model prior to the training and the first model subsequent to the training.” (The Proposed pFedSD, pp. 570; “Here the hyperparameter λ controls the contributions of knowledge distillation.                         
                            
                                    F
                                
                                    k
                                
                                    ⋅
                                
                     denotes the cross entropy (CE) loss of client k.                         
                            
                                    L
                                
                                    K
                                    L
                                
                     denotes the Kullback-Leibler (KL) divergence function between past personalized prediction                         
                            q
                            
                                            v
                                        
                                            k
                                        
                     and current local prediction                         
                            q
                            
                                            w
                                        
                                            k
                                        
                                            t
                                        
                    .” This process discloses the loss function and how and when the models are trained. This method will compare the previous model and the current model to determine a distance between the two models. If there is a large difference than more training is needed.) 

Claims 7, 14, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Jin and Kothandaraman in view of Mishra et al., (Mishra et al., “Confidence Conditioned Knowledge Distillation”, 2021, hereinafter “Mishra”).

Regarding claim 7, Mishra discloses, “wherein the identifying of the first data item for training the first model is further based on an confidence score for an output generated by inputting the first data item to the first model in real time being less than the predetermined value.” (Figure 2, pp. 13; This figure shows the method process for the self-regulation algorithm. This algorithm discloses that the student model produces an output and that output is tested to generate a difference between the predicted output and ground truth output. This score is represented as                 
                    δ
                
            .                 
                    δ
                
             is then evaluated to see if it meets a threshold,                 
                    δ
                    <
                     
                    η
                
            .)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Jin, Kothandaraman and Mishra. Jin teaches a federated machine learning system where the edge devices perform knowledge distillation using the previous local model as the teacher model and the aggregated and updated model from the server as the student model. Kothandaraman teaches a machine learning system that uses adaptive learning and knowledge distillation to train student model using multiple loss functions for object detection in autonomous vehicles. Mishra teaches different adaptive learning and self-regulation techniques for knowledge distillation models which incorporate confidence scores/values to train the teacher and student models. One of ordinary skill would have motivation to combine a federated learning system which uses knowledge distillation and substitute the knowledge distillation model with another knowledge distillation model which uses multiple loss functions to detect objects for autonomous driving and combine the different training methods of Mishra to further improve the student model, “In comparison to these methods, CCKD-T+Reg method uses fewer samples due to self-regulation. Adding self-regulation decreases the amount of data required to achieve a comparable level of generalization performance. In general, a slight decrease in performance is observed (Tables 2-5) as CCKD-T+Reg method does not use all the samples present in the dataset across all epochs. For the CIFAR10 dataset, the sample efficiency results for the AlexNet case are reported. CIFAR10 dataset is more realistic compared to MNIST and Fashion-MNIST datasets, so the sample utilization is the highest.” (Mishra, Sample Efficiency on adding Self-Regulation, pp. 19-20)

Regarding claim 14, Mishra discloses, “wherein the processor is further configured to execute the instructions to input the first data item to the first model in real time and obtain an output and a confidence score for the output, and” (Figure 2, pp. 13; This figure discloses a self-regulation algorithm used for knowledge distillation. This model will evaluate the student prediction and the ground truth prediction. The model will then generate a confidence score labeled                 
                    δ
                
            .)
“wherein the identification of the first data item for training the first model is further based on the confidences score being less than the predetermined value.” (Figure 2, pp. 13; This figure shows the method process for the self-regulation algorithm. This algorithm discloses that the student model produces an output and that output is tested to generate a difference between the predicted output and ground truth output. This score is represented as                 
                    δ
                
            .                 
                    δ
                
             is then evaluated to see if it meets a threshold,                 
                    δ
                    <
                     
                    η
                
            .) 

Regarding claim 20, Mishra discloses, “wherein the instructions further comprise one or more instructions that, when executed by one or more processors of the device, cause the one or more processors to input the first data item to the first model in real time and obtain an output and a confidence score for the output, and” (Figure 2, pp. 13; This figure discloses a self-regulation algorithm used for knowledge distillation. This model will evaluate the student prediction and the ground truth prediction. The model will then generate a confidence score labeled                 
                    δ
                
            .)
“wherein the identification of the first data item for training the first model is further based on the confidence score being less than the predetermined value.” (Figure 2, pp. 13; This figure shows the method process for the self-regulation algorithm. This algorithm discloses that the student model produces an output and that output is tested to generate a difference between the predicted output and ground truth output. This score is represented as                 
                    δ
                
            .                 
                    δ
                
             is then evaluated to see if it meets a threshold,                 
                    δ
                    <
                     
                    η
                
            .) 

Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to PAUL MICHAEL GALVIN-SIEBENALER whose telephone number is (571)272-1257. The examiner can normally be reached Monday - Friday 8AM to 5PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Viker Lamardo can be reached at (571) 270-5871. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/PAUL M GALVIN-SIEBENALER/Examiner, Art Unit 2147                                                                                                                                                                                                        
/VIKER A LAMARDO/Supervisory Patent Examiner, Art Unit 2147
Read full office action
Prosecution Timeline

Dec 29, 2022
Application Filed
Sep 24, 2025
Non-Final Rejection — §103
Dec 15, 2025
Response Filed
Feb 26, 2026
Final Rejection — §103 (current)
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
20%
Grant Probability
20%
With Interview (+0.0%)
3y 9m (~5m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 5 resolved cases by this examiner. Grant probability derived from career allowance rate.
SYSTEM AND METHOD FOR FEDERATED LEARNING FOR AUTOMOTIVE APPLICATION WITH KNOWLEDGE DISTILLATION BY TEACHER MODEL

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

SYSTEM AND METHOD FOR FEDERATED LEARNING FOR AUTOMOTIVE APPLICATION WITH KNOWLEDGE DISTILLATION BY TEACHER MODEL

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email