Prosecution Insights
Last updated: April 19, 2026
Application No. 18/139,765

TRAINING MODELS UNDER RESOURCE CONSTRAINTS FOR CROSS-DEVICE FEDERATED LEARNING

Non-Final OA §103
Filed
Apr 26, 2023
Examiner
SIPPEL, MOLLY CLARKE
Art Unit
2122
Tech Center
2100 — Computer Architecture & Software
Assignee
International Business Machines Corporation
OA Round
1 (Non-Final)
50%
Grant Probability
Moderate
1-2
OA Rounds
3y 7m
To Grant
99%
With Interview

Examiner Intelligence

Grants 50% of resolved cases
50%
Career Allow Rate
7 granted / 14 resolved
-5.0% vs TC avg
Strong +58% interview lift
Without
With
+58.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 7m
Avg Prosecution
25 currently pending
Career history
39
Total Applications
across all art units

Statute-Specific Performance

§101
33.8%
-6.2% vs TC avg
§103
32.0%
-8.0% vs TC avg
§102
9.8%
-30.2% vs TC avg
§112
23.6%
-16.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 14 resolved cases

Office Action

§103
DETAILED ACTION This action is responsive to the application filed on 04/26/2023. Claims 1-20 are pending in the case. Claims 1, 13, and 18 are independent claims. Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Information Disclosure Statement The information disclosure statement (IDS) submitted on 04/26/2023 is being considered by the examiner. The information disclosure statement (IDS) submitted on 05/26/2023 is being considered by the examiner. Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 1-2, 4-9, 12-20 are rejected under 35 U.S.C. 103 as being unpatentable over Zhao et al., When does the student surpass the teacher? Federated Semi-supervised Learning with Teacher-Student EMA, 01/24/2023, https://arxiv.org/pdf/2301.10114, hereinafter referred to as “Zhao” in view of Nguyen et al. (2022 April 4) CDKT-FL: Cross-Device Knowledge Transfer using Proxy Dataset in Federated Learning, hereinafter referred to as “Nguyen”. Regarding claim 1, Zhao teaches A … method of training a global student model (Zhao, Page 1, Abstract, Lines 19-22, “We propose a novel approach FedSwitch, that improves privacy as well as generalization performance through Exponential Moving Average (EMA) updates”; see also Zhao, Page 4, Figure 1), comprising: storing, on a server, the global student model comprising a first layer and a teacher model comprising a first layer (Zhao, Page 4, Figure 1, A “teacher model” and “student model” can be seen inside the yellow “server” square, a person of ordinary skill in the art would recognize that a machine learning model would comprise “a first layer”) transmitting, from the server, local student models based on the global student model, the local student models each comprising … and a first layer (Zhao, Page 4, Figure 1, The “message” seen at the bottom of the figure is transmitted from the server to the clients, and the “student model” shown in the “client” square is considered to be the “local student model”, a person of ordinary skill in the art would recognize the “local student models” comprise “a first layer”; Zhao, Page 5, Adaptive Switching, Paragraph 2, Lines 13-19, “If the teacher’s mean KL-divergence is closer to β , both the teacher and student model are sent to clients in the next round, and clients use the teacher model to generate pseudo-labels. If the student’s mean KL-divergence is closer to β, only the student model is sent to clients in the next round, and clients use the student model to generate pseudo-labels similar to FedProx-FixMatch”); … receiving, at the server, first layer weights of the local student models (Zhao, Page 4, Section 3.3, Lines 6-8, “in every t-th communication round the updates from the local student models S t k on each k-th client are sent to the server”; see also Zhao, Page 4, Figure 1, The “student gradient” which is considered to be the “first layer weights of the local student model” are shown being transmitted from the client to the server); and calculating, on the server, first layer weights of the global student model using the received first layer weights of the local student models (Zhao, Page 5, Algorithm 2 Labels-at-Server (sequential), Line 14, “ S t + 1 ← 1 m ∑ k ∈ L t S t + 1 k ”; see also Zhao, Page 4, Figure 1, see the left side of the figure each client student model can be seen forming one global student model on the server). Zhao also teaches that the method is computer-implemented (Zhao, Page 6, Section 5.2, Lines 1-7, “For experiments on the CIFAR-10 dataset, all training runs use 100 clients in total, 5 clients per round, and 1 local client epoch per round. For ease of comparison with baseline approaches, we follow FedMatch settings in our implementation. Data is distributed evenly across 100 clients for labels-at-client, i.e. we have 5 labels per class per client for a total of 5,000 labeled examples”; A person of ordinary skill in the art would recognize that these experiments would require the use of a computer). Zhao does not explicitly teach that the local student models comprise an embedding layer nor receiving, at the server, an embedding layer output of one of the local student models; performing, on the server, a forward pass on the first layer of the teacher model, with the embedding layer output as an input, to generate a teacher model first layer output; transmitting, from the server, the teacher model first layer output. Nguyen teaches that the local student models comprise an embedding layer (Nguyen, Page 3, Figure 1, Each client is shown with a middle layer with three nodes, which is considered to be the “embedding layer”; Nguyen, Page 3, Col 1, Lines 4-5, “embedding representation features (i.e., activation results e of the intermediate layer)”) and receiving, at the server, an embedding layer output of one of the local student models (Nguyen, Page 2, Contributions, Lines 8-11, “we propose cross-device knowledge transfer (CDKT) with two mechanisms: 1) global knowledge transfer to transfer the knowledge from client models to the global model”; Nguyen, Page 3, Col 1, Lines 2-6, “the transferable knowledge is the outcomes (i.e., activation results z of the final layer) and (or) embedding representation features (i.e., activation results e of the intermediate layer) given the proxy samples as shown in Fig. 1”; see also, Nguyen, Page 3, Figure 1); performing, on the server, a forward pass on the first layer of the teacher model, with the embedding layer output as an input, to generate a teacher model first layer output (Nguyen, Page 4, Section 2.1, Bullet point 2, Lines 1-4 and Equation 5, “We formulate the general generalized model construction problem with the global CDKT regularizer using a generic function d … l s D r = l C E z s D r + β d e s , ∑ n ∈ N 1 N e n | D r + β d ( z s , λ y r + 1 - λ ∑ n ∈ N 1 N z n | D r ) ”; see also, Nguyen, Page 4, Algorithm 1, Steps 8-11; The “embedding layer output” is used during the loss calculation on the server, and is thus considered an “input” to the teacher model; see also Nguyen, Page 3, Figure 1, The final layer of the teacher model, shown with four nodes, is considered to be the “first layer of the teacher model” and the output, “ z s ” is considered to be the “first layer output”); transmitting, from the server, the teacher model first layer output (Nguyen, Page 2, Contributions, Lines 8-13, “we propose cross-device knowledge transfer (CDKT) with two mechanisms: 1) global knowledge transfer to transfer the knowledge from client models to the global model, 2) on-device knowledge transfer to transfer the generalized knowledge of the global model to the client models”; see also Nguyen, Page 3, Figure 1, “ g s ” can be seen being transferred from the server model, considered to be the teacher model, to the device models). It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to have modified the global student model training method of Zhao to include the cross-device knowledge transfer method of Nguyen. The motivation to do so would have been that the knowledge transfer method achieves improved speed and better personalization of local models while tackling privacy leakage issues with minimal communication data load during transfers (Nguyen, Page 1, Section 1, Paragraph 2). Regarding claim 2, the rejection of claim 1 is incorporated, and further, the proposed combination teaches wherein the local student models are each transmitted to a different client device (Zhao, Page 5, Algorithm 1 Labels-at-Client Description, Line 1, “Client devices 1,…K”; Zhao, Page 5, Algorithm 2 Labels-at-Server (sequential) Description, Lines 1-2, “Client devices 1,…K”; Zhao, Page 4, Section 3.3, Lines 5-6, “student models are updated after every batch of data on their respective devices”). Regarding claim 4, the rejection of claim 2 is incorporated, and further, the proposed combination teaches wherein calculating, on the server, the first layer weights of the global student model comprises a federated averaging process (Zhao, Page 5, Algorithm 2 Labels-at-Server (sequential), Line 14, “ S t + 1 ← 1 m ∑ k ∈ L t S t + 1 k ”; The aggregation is performed and then divided by the number of clients, thus this process is considered to be “a federated averaging process”). Regarding claim 5, the rejection of claim 1 is incorporated. The proposed combination thus far does not explicitly teach training, on the server, the teacher model on public datasets. Nguyen teaches training, on the server, the teacher model on public datasets (Nguyen, Page 4, Algorithm 1, Step 10, “We loop through batches of proxy data”; Nguyen, Page 3, Col 1, Lines 6-9, “We assume this proxy data is of small size and available by clients to follow a practical scenario where the system can pre-collect a small amount of labeled data at the beginning”; Because the “proxy data” is available to the clients, it is considered to be “public datasets”). It would have been obvious, to a person of ordinary skill in the art, before the effective filing date of the invention, to have modified the global student training method of the proposed combination to include training the teacher model on public datasets as taught by Nguyen. The motivation to do so would have been that having a public dataset available to all devices improves global model performance (Nguyen, Page 3, Col 1, Lines 7-9, “follow a practical scenario where the system can pre-collect a small amount of labeled data at the beginning”; Nguyen, Page 3, Col 1, Paragraph 2, Lines 1-5, “Our experiments reveal that (i) the outcome distribution and embedding features’ representation of the proxy data shared by all devices and servers are transferable and (ii) quickly improves the global model performance and personalized performance of the client models”). Regarding claim 6, the rejection of claim 1 is incorporated, and further, the proposed combination teaches selecting, by the server, a number of clients to transmit the first teacher model output from a number of available clients, each selected client receiving one of the local student models (Zhao, Page 5, Algorithm 1 Labels-at-Client, Line 5, “ L t ← ( r a n d o m   s e t   o f   m   c l i e n t s ) ”; Zhao, Page 5, Algorithm 2 Labels-at-Server (sequential), Line 5, “ L t ← ( r a n d o m   s e t   o f   m   c l i e n t s ) ”; Further, the first line of each algorithm, “Server executes:” demonstrates this selection is performed “by the server” and the remainder of the algorithms demonstrate that only these selected clients are used for this iteration of the method). Regarding claim 7, the rejection of claim 6 is incorporated, and further, the proposed combination teaches wherein each client of the number of clients comprises one or more client devices (Zhao, Page 5, Algorithm 1 Labels-at-Client Description, Line 1, “Client devices 1,…K”; Zhao, Page 5, Algorithm 2 Labels-at-Server (sequential) Description, Lines 1-2, “Client devices 1,…K”; Zhao, Page 4, Section 3.3, Lines 5-6, “student models are updated after every batch of data on their respective devices”). Regarding claim 8, the rejection of claim 7 is incorporated, and further, the proposed combination teaches wherein each client device comprises locally stored data sets (Zhao, Page 4, Figure 1, The client is depicted as containing “client data k” which is considered to be the “locally stored data sets”). Regarding claim 9, the rejection of claim 1 is incorporated, and further, the proposed combination teaches wherein the embedding layer output does not comprise data from a data set stored locally on a client device (Nguyen, Page 3, Col 1, Lines 2-6, “the transferable knowledge is the outcomes (i.e., activation results z of the final layer) and (or) embedding representation features (i.e., activation results e of the intermediate layer) given the proxy samples as shown in Fig. 1”; A person of ordinary skill in the art would recognize that an “activation result” of the intermediate layer would not contain data from the original dataset). Regarding claim 12, the rejection of claim 1 is incorporated, and further, the proposed combination teaches performing, on the server, a forward pass on a second layer of the teacher model, with the embedding layer output as an input, to generate a teacher model second layer output (Nguyen, Page 4, Section 2.1, Bullet point 2, Lines 1-4 and Equation 5, “We formulate the general generalized model construction problem with the global CDKT regularizer using a generic function d … l s D r = l C E z s D r + β d e s , ∑ n ∈ N 1 N e n | D r + β d ( z s , λ y r + 1 - λ ∑ n ∈ N 1 N z n | D r ) ”; see also, Nguyen, Page 4, Algorithm 1, Steps 8-11; The “embedding layer output” is used during the loss calculation on the server, and is thus considered an “input” to the teacher model; see also Nguyen, Page 3, Figure 1, The final layer of the teacher model, shown with four nodes, is considered to be the “first layer of the teacher model” and the output, “ z s ” is considered to be the “first layer output”, because the algorithm is run in a loop, as shown by steps 2 and 12 of the algorithm, the second time the loop is executed the final layer of the teacher model, shown with four nodes, is considered to be the “second layer of the teacher model” as it is updated during each iteration and is thus an updated layer different from the “first layer” and “ z s ” is considered to be the “second layer output”); transmitting, from the server, the teacher model second layer output (Nguyen, Page 2, Contributions, Lines 8-13, “we propose cross-device knowledge transfer (CDKT) with two mechanisms: 1) global knowledge transfer to transfer the knowledge from client models to the global model, 2) on-device knowledge transfer to transfer the generalized knowledge of the global model to the client models”; see also Nguyen, Page 3, Figure 1, “ g s ” can be seen being transferred from the server model, considered to be the teacher model, to the device models); receiving, at the server, second layer weights of the local student models (Zhao, Page 4, Section 3.3, Lines 6-8, “in every t-th communication round the updates from the local student models S t k on each k-th client are sent to the server”; see also Zhao, Page 4, Figure 1, The “student gradient” which is considered to be the “first layer weights of the local student model” are shown being transmitted from the client to the server, because the algorithm is run on a loop, as shown by steps 2 and 12 of the algorithm, the second iteration of the method transmits the “second layer weights of the local student model” as the local models are updated with each iteration); and calculating, on the server, second layer weights of the global student model using the received second layer weights of the local student models (Zhao, Page 5, Algorithm 2 Labels-at-Server (sequential), Line 14, “ S t + 1 ← 1 m ∑ k ∈ L t S t + 1 k ”; see also Zhao, Page 4, Figure 1, see the left side of the figure each client student model can be seen forming one global student model on the server). Regarding claim 13, Zhao teaches A … method of training a global student model (Zhao, Page 1, Abstract, Lines 19-22, “We propose a novel approach FedSwitch, that improves privacy as well as generalization performance through Exponential Moving Average (EMA) updates”; see also Zhao, Page 4, Figure 1), comprising: receiving, on a client device comprising a data set, a local student model based on the global student model, the local student model comprising … and a first layer (Zhao, Page 4, Figure 1, The “message” seen at the bottom of the figure is transmitted from the server to the clients, and the “student model” shown in the “client” square is considered to be the “local student model”, a person of ordinary skill in the art would recognize the “local student models” comprise “a first layer”; Zhao, Page 5, Adaptive Switching, Paragraph 2, Lines 13-19, “If the teacher’s mean KL-divergence is closer to β , both the teacher and student model are sent to clients in the next round, and clients use the teacher model to generate pseudo-labels. If the student’s mean KL-divergence is closer to β, only the student model is sent to clients in the next round, and clients use the student model to generate pseudo-labels similar to FedProx-FixMatch”); … training, on the client device, the first layer of the local student model until … converges … (Zhao, Page 7, Section 6.1, Lines 9-10, “we let the method run until convergence (800 epochs, equals 16000 rounds)”); and transmitting, from the client device, first layer weights of the first layer of the local student model (Zhao, Page 4, Section 3.3, Lines 6-8, “in every t-th communication round the updates from the local student models S t k on each k-th client are sent to the server”; see also Zhao, Page 4, Figure 1, The “student gradient” which is considered to be the “first layer weights of the local student model” are shown being transmitted from the client to the server). Zhao also teaches that the method is computer-implemented (Zhao, Page 6, Section 5.2, Lines 1-7, “For experiments on the CIFAR-10 dataset, all training runs use 100 clients in total, 5 clients per round, and 1 local client epoch per round. For ease of comparison with baseline approaches, we follow FedMatch settings in our implementation. Data is distributed evenly across 100 clients for labels-at-client, i.e. we have 5 labels per class per client for a total of 5,000 labeled examples”; A person of ordinary skill in the art would recognize that these experiments would require the use of a computer). Zhao does not explicitly teach that the local student models comprise an embedding layer, nor outputting, on the client device, an embedding layer output from the embedding layer; transmitting, from the client device, the embedding layer output; performing, on the client device, a forward pass on the first layer, with the embedding layer output as an input, to generate a student model first layer output; receiving, on the client device, a teacher model first layer output; calculating, on the client device, a loss based on the student model first layer output and the teacher model first layer output; nor training the first layer of the local student model based on the student model first layer output and the teacher model first layer output. Nguyen teaches that the local student models comprise an embedding layer (Nguyen, Page 3, Figure 1, Each client is shown with a middle layer with three nodes, which is considered to be the “embedding layer”; Nguyen, Page 3, Col 1, Lines 4-5, “embedding representation features (i.e., activation results e of the intermediate layer)”) and outputting, on the client device, an embedding layer output from the embedding layer (Nguyen, Page 3, Figure 1, The middle layers, shown with three nodes, each have an arrow demonstrating the embedding output being output from the embedding layer, “ e 1 = w 1 ( x ) ”); transmitting, from the client device, the embedding layer output (Nguyen, Page 2, Contributions, Lines 8-11, “we propose cross-device knowledge transfer (CDKT) with two mechanisms: 1) global knowledge transfer to transfer the knowledge from client models to the global model”; Nguyen, Page 3, Col 1, Lines 2-6, “the transferable knowledge is the outcomes (i.e., activation results z of the final layer) and (or) embedding representation features (i.e., activation results e of the intermediate layer) given the proxy samples as shown in Fig. 1”; see also, Nguyen, Page 3, Figure 1); performing, on the client device, a forward pass on the first layer, with the embedding layer output as an input, to generate a student model first layer output (Nguyen, Page 3, Figure 1, The middle layers, shown with three nodes, each have an arrow demonstrating the embedding output being output from the embedding layer, “ e 1 = w 1 x … e n = w n ( x ) ”,and demonstrates the embedding output being passed to the layer represented by four nodes, which is considered to be the “first layer, and the “outcome” as the output, “ z 1 = h 1 w 1 x … z n = h n w n x ”, which is considered to be the “student model first layer output”); receiving, on the client device, a teacher model first layer output (Nguyen, Page 2, Contributions, Lines 8-13, “we propose cross-device knowledge transfer (CDKT) with two mechanisms: 1) global knowledge transfer to transfer the knowledge from client models to the global model, 2) on-device knowledge transfer to transfer the generalized knowledge of the global model to the client models”; see also Nguyen, Page 3, Figure 1, “ g s ” can be seen being transferred from the server model, considered to be the teacher model, to the device models); calculating, on the client device, a loss based on the student model first layer output and the teacher model first layer output (Nguyen, Page 4, Section 2.2, Lines 1-6 and Equation 8, “Using a similar design, the on-device learning problem helps the client models to improve their generalization capabilities by imitating the generalized knowledge from the global model. In this approach, the client n utilizes the private dataset Dn and also the generalized knowledge from the global model obtained using the proxy dataset Dr … l c n D n ,   D r = l C E n z n D n + α d g n , g s D r =   l C E n z n D n + α d e n , e s D r + α d ( z n , λ y s + ( 1 - λ ) z s | D r ) ”); and training the first layer of the local student model based on the student model first layer output and the teacher model first layer output (Nguyen, Page 3, Section 2, Paragraph 2, Lines 5-8, “each client performs K local epochs with its private and proxy dataset based on the gradient of the loss function in the on-device learning problem (8), with the local learning rate η”). It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to have modified the global student model training method of Zhao to include the cross-device knowledge transfer method of Nguyen. The motivation to do so would have been that the knowledge transfer method achieves improved speed and better personalization of local models while tackling privacy leakage issues with minimal communication data load during transfers (Nguyen, Page 1, Section 1, Paragraph 2). Regarding claim 14, the rejection of claim 13 is incorporated, and further, the proposed combination teaches wherein the embedding layer output does not comprise data from the data set (Nguyen, Page 3, Col 1, Lines 2-6, “the transferable knowledge is the outcomes (i.e., activation results z of the final layer) and (or) embedding representation features (i.e., activation results e of the intermediate layer) given the proxy samples as shown in Fig. 1”; A person of ordinary skill in the art would recognize that an “activation result” of the intermediate layer would not contain data from the original dataset). Regarding claim 15, the rejection of claim 13 is incorporated, and further, the proposed combination teaches performing, on the client device, a forward pass on a second layer of the local student model, with the embedding layer output as an input, to generate a student model second layer output (Nguyen, Page 3, Figure 1, The middle layers, shown with three nodes, each have an arrow demonstrating the embedding output being output from the embedding layer, “ e 1 = w 1 x … e n = w n ( x ) ”,and demonstrates the embedding output being passed to the layer represented by four nodes, which is considered to be the “first layer, and the “outcome” as the output, “ z 1 = h 1 w 1 x … z n = h n w n x ”, which is considered to be the “student model first layer output”; see also Nguyen, Page 4, Algorithm 1, Lines 2, 4-6, and 12; Because the method is done in a loop, the second iteration performed is considered to be performed on “a second layer” as the layer is updated each iteration, and for the same reason, the output is considered “a student model second layer output”); receiving, on the client device, a teacher model second layer output (Nguyen, Page 2, Contributions, Lines 8-13, “we propose cross-device knowledge transfer (CDKT) with two mechanisms: 1) global knowledge transfer to transfer the knowledge from client models to the global model, 2) on-device knowledge transfer to transfer the generalized knowledge of the global model to the client models”; see also Nguyen, Page 3, Figure 1, “ g s ” can be seen being transferred from the server model, considered to be the teacher model, to the device models); calculating, on the client device, a loss based on the local student model second layer output and the teacher model second layer output (Nguyen, Page 4, Section 2.2, Lines 1-6 and Equation 8, “Using a similar design, the on-device learning problem helps the client models to improve their generalization capabilities by imitating the generalized knowledge from the global model. In this approach, the client n utilizes the private dataset Dn and also the generalized knowledge from the global model obtained using the proxy dataset Dr … l c n D n ,   D r = l C E n z n D n + α d g n , g s D r =   l C E n z n D n + α d e n , e s D r + α d ( z n , λ y s + ( 1 - λ ) z s | D r ) ”); training, on the client device, the second layer of the student model until the student model second layer output converges with the teacher model second layer output (Zhao, Page 7, Section 6.1, Lines 9-10, “we let the method run until convergence (800 epochs, equals 16000 rounds)”; Nguyen, Page 3, Section 2, Paragraph 2, Lines 5-8, “each client performs K local epochs with its private and proxy dataset based on the gradient of the loss function in the on-device learning problem (8), with the local learning rate η”); and transmitting, from the client device, second layer weights of the second layer of the local student model (Zhao, Page 4, Section 3.3, Lines 6-8, “in every t-th communication round the updates from the local student models S t k on each k-th client are sent to the server”; see also Zhao, Page 4, Figure 1, The “student gradient” which is considered to be the “first layer weights of the local student model” are shown being transmitted from the client to the server, during the second iteration of the algorithm the “student gradient” is considered to be the “second layer weights of the local student model”). Regarding claim 16, the rejection of claim 13 is incorporated, and further, the proposed combination teaches wherein the client device uses linear layers to match the local student model first layer output and the teacher model first layer output (Nguyen, Page 3, Section 2, Lines 5-8, “We first define the neural network with the prediction outcomes (i.e., z = h ◦ w(x)), where h is the projection head, which is a small neural network (i.e., the last fully-connected layers; a small classifier)”; The “projection heads” are considered “linear layers” which are used to generate the first layer outputs; Nguyen, Page 4, Equation 5, “ l s D r = l C E z s D r + β d e s , ∑ n ∈ N 1 N e n | D r + β d ( z s , λ y r + 1 - λ ∑ n ∈ N 1 N z n | D r ) ”). Regarding claim 17, the rejection of claim 13 is incorporated, and further, the proposed combination teaches wherein the client device trains the local student model using a Kullback-Leibler loss function (Nguyen, Page 4, Section 2.1, Bullet Point 2, Lines 16-19, “Here, we note that various distance functions d can be used for knowledge transfer, such as Norm, KL divergence, and Jensen-Shannon (JS) divergence”; Nguyen, Page 4, Equation 8, “ l c n D n ,   D r = l C E n z n D n + α d g n , g s D r =   l C E n z n D n + α d e n , e s D r + α d ( z n , λ y s + ( 1 - λ ) z s | D r ) ”; see also Nguyen, Page 8, Section 5.2). Regarding claim 18, Zhao teaches A system of training a global student model stored on a server, the server comprising a processing device and a memory comprising instructions that are executed by the processing device (Zhao, Page 6, Section 5.2, Lines 1-7, “For experiments on the CIFAR-10 dataset, all training runs use 100 clients in total, 5 clients per round, and 1 local client epoch per round. For ease of comparison with baseline approaches, we follow FedMatch settings in our implementation. Data is distributed evenly across 100 clients for labels-at-client, i.e. we have 5 labels per class per client for a total of 5,000 labeled examples”; A person of ordinary skill in the art would recognize that these experiments would require the use of a computer, providing evidence for “a processing device and a memory comprising instructions”; see also Zhao, Page 4, Figure 1, A server can be seen) to perform a method comprising: storing, on the server, a global student model comprising a first layer and a teacher model comprising a first layer (Zhao, Page 4, Figure 1, A “teacher model” and “student model” can be seen inside the yellow “server” square, a person of ordinary skill in the art would recognize that a machine learning model would comprise “a first layer”) transmitting, from the server, local student models based on the global student model, the local student models each comprising … and a first layer (Zhao, Page 4, Figure 1, The “message” seen at the bottom of the figure is transmitted from the server to the clients, and the “student model” shown in the “client” square is considered to be the “local student model”, a person of ordinary skill in the art would recognize the “local student models” comprise “a first layer”; Zhao, Page 5, Adaptive Switching, Paragraph 2, Lines 13-19, “If the teacher’s mean KL-divergence is closer to β , both the teacher and student model are sent to clients in the next round, and clients use the teacher model to generate pseudo-labels. If the student’s mean KL-divergence is closer to β, only the student model is sent to clients in the next round, and clients use the student model to generate pseudo-labels similar to FedProx-FixMatch”); … receiving, at the server, first layer weights of the local student models (Zhao, Page 4, Section 3.3, Lines 6-8, “in every t-th communication round the updates from the local student models S t k on each k-th client are sent to the server”; see also Zhao, Page 4, Figure 1, The “student gradient” which is considered to be the “first layer weights of the local student model” are shown being transmitted from the client to the server); and calculating, on the server, first layer weights of the global student model using the received first layer weights of the local student models (Zhao, Page 5, Algorithm 2 Labels-at-Server (sequential), Line 14, “ S t + 1 ← 1 m ∑ k ∈ L t S t + 1 k ”; see also Zhao, Page 4, Figure 1, see the left side of the figure each client student model can be seen forming one global student model on the server). Zhao also teaches that the method is computer-implemented (Zhao, Page 6, Section 5.2, Lines 1-7, “For experiments on the CIFAR-10 dataset, all training runs use 100 clients in total, 5 clients per round, and 1 local client epoch per round. For ease of comparison with baseline approaches, we follow FedMatch settings in our implementation. Data is distributed evenly across 100 clients for labels-at-client, i.e. we have 5 labels per class per client for a total of 5,000 labeled examples”; A person of ordinary skill in the art would recognize that these experiments would require the use of a computer). Zhao does not explicitly teach that the local student models comprise an embedding layer nor receiving, at the server, an embedding layer output of one of the local student models; performing, on the server, a forward pass on the first layer of the teacher model, with the embedding layer output as an input, to generate a teacher model first layer output; transmitting, from the server, the teacher model first layer output. Nguyen teaches that the local student models comprise an embedding layer (Nguyen, Page 3, Figure 1, Each client is shown with a middle layer with three nodes, which is considered to be the “embedding layer”; Nguyen, Page 3, Col 1, Lines 4-5, “embedding representation features (i.e., activation results e of the intermediate layer)”) and receiving, at the server, an embedding layer output of one of the local student models (Nguyen, Page 2, Contributions, Lines 8-11, “we propose cross-device knowledge transfer (CDKT) with two mechanisms: 1) global knowledge transfer to transfer the knowledge from client models to the global model”; Nguyen, Page 3, Col 1, Lines 2-6, “the transferable knowledge is the outcomes (i.e., activation results z of the final layer) and (or) embedding representation features (i.e., activation results e of the intermediate layer) given the proxy samples as shown in Fig. 1”; see also, Nguyen, Page 3, Figure 1); performing, on the server, a forward pass on the first layer of the teacher model, with the embedding layer output as an input, to generate a teacher model first layer output (Nguyen, Page 4, Section 2.1, Bullet point 2, Lines 1-4 and Equation 5, “We formulate the general generalized model construction problem with the global CDKT regularizer using a generic function d … l s D r = l C E z s D r + β d e s , ∑ n ∈ N 1 N e n | D r + β d ( z s , λ y r + 1 - λ ∑ n ∈ N 1 N z n | D r ) ”; see also, Nguyen, Page 4, Algorithm 1, Steps 8-11; The “embedding layer output” is used during the loss calculation on the server, and is thus considered an “input” to the teacher model; see also Nguyen, Page 3, Figure 1, The final layer of the teacher model, shown with four nodes, is considered to be the “first layer of the teacher model” and the output, “ z s ” is considered to be the “first layer output”); transmitting, from the server, the teacher model first layer output (Nguyen, Page 2, Contributions, Lines 8-13, “we propose cross-device knowledge transfer (CDKT) with two mechanisms: 1) global knowledge transfer to transfer the knowledge from client models to the global model, 2) on-device knowledge transfer to transfer the generalized knowledge of the global model to the client models”; see also Nguyen, Page 3, Figure 1, “ g s ” can be seen being transferred from the server model, considered to be the teacher model, to the device models). It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to have modified the global student model training method of Zhao to include the cross-device knowledge transfer method of Nguyen. The motivation to do so would have been that the knowledge transfer method achieves improved speed and better personalization of local models while tackling privacy leakage issues with minimal communication data load during transfers (Nguyen, Page 1, Section 1, Paragraph 2). Regarding claim 19, the rejection of claim 18 is incorporated, and further, the proposed combination teaches wherein the local student models are each transmitted to a different client device (Zhao, Page 5, Algorithm 1 Labels-at-Client Description, Line 1, “Client devices 1,…K”; Zhao, Page 5, Algorithm 2 Labels-at-Server (sequential) Description, Lines 1-2, “Client devices 1,…K”; Zhao, Page 4, Section 3.3, Lines 5-6, “student models are updated after every batch of data on their respective devices”). Regarding claim 20, the rejection of claim 18 is incorporated, and further, the proposed combination teaches wherein the received embedding layer output does not comprise data from a data set stored locally on a client device (Nguyen, Page 3, Col 1, Lines 2-6, “the transferable knowledge is the outcomes (i.e., activation results z of the final layer) and (or) embedding representation features (i.e., activation results e of the intermediate layer) given the proxy samples as shown in Fig. 1”; A person of ordinary skill in the art would recognize that an “activation result” of the intermediate layer would not contain data from the original dataset). Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Zhao in view of Nguyen in further view of Cao et al., C2S: Class-aware client selection for effective aggregation in federated learning, High-Confidence Computing, Volume 2, Issue 3, 2022, 100068, ISSN 2667-2952, https://doi.org/10.1016/j.hcc.2022.100068, hereinafter referred to as “Cao”. Regarding claim 3, the rejection of claim 2 is incorporated. The proposed combination does not explicitly teach wherein the local student model training layer weights are aggregated by weighing the local student models based on training sample size. Cao teaches wherein the local student model training layer weights are aggregated by weighing the local student models based on training sample size (Cao, Page 2, Section 2.1.2, Paragraph 2, Lines 6-9, “All those local parameters will be aggregated through weighted averaging in server, with weight of one client defined as the size of its local dataset over the total size of data in all selected clients”). It would have been obvious, to a person of ordinary skill in the art, before the effective filing date of the invention, to have modified the global student model training method of the proposed combination to include weighting local student model training layer weights during aggregation based on training sample size as taught by Cao. The motivation to do so would have been that weighted aggregation works well for IID data and is relatively simple and straightforward (Cao, Page 2, Section 2.1.2, Paragraph 2, Lines 1-2, “Alternatively, a more simple and straightforward way is to set the same model for server and clients. The typical procedure is as follows”; Cao, Page 1, Section 1, Paragraph 4, Lines 1-2, “This scheme works well for IID (independently identically distribution) data”). Claims 10-11 are rejected under 35 U.S.C. 103 as being unpatentable over Zhao in view of Nguyen in further view of Tan et al., Federated Learning from Pre-Trained Models: A Contrastive Learning Approach, 09/21/2022, https://arxiv.org/pdf/2209.10083, hereinafter referred to as “Tan”. Regarding claim 10, the rejection of claim 1 is incorporated. The proposed combination thus far does not explicitly teach wherein the embedding layer is pre-trained on the server using the teacher model. Tan teaches wherein the embedding layer is pre-trained on the server using the teacher model (Tan, Page 14, Section A.1.3, Lines 1-3, “For the single backbone cases, we use ResNet18 pre-trained on Quickdraw as the backbone. For the multiple backbone cases, we use three pre-trained ResNet18 as the backbones. They are pre-trained on Quick Draw [79], Aircraft [80], and CU-Birds [19] public dataset, respectively”; Because the teacher model contains an embedding layer, the pre-training is considered to be performed on the server using the teacher model; the “backbone” is considered to be the “embedding layer”; see also Tan, Page 4, Figure 1). It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to have modified the global student model training method of the proposed combination to include pre-training the embedding layer on the server using the teacher model as taught by Tan. The motivation to do so would have been that using pre-trained models reduces cost (Tan, Page 2, Lines 4-6, “Using the pre-trained foundation models as the fixed encoder can efficiently reduce costs because neither complicated backward propagation computation nor large-scale neural network transmission between the server and clients is needed during the training stage”). Regarding claim 11, the rejection of claim 10 is incorporated, and further, the proposed combination teaches wherein the local student models are not transmitted until a loss of the embedding layer is less than a threshold loss (Zhao, Page 5, Adaptive Switching, Paragraph 2, Lines 7-19, “For each unlabeled batch, we calculate the KL-divergence between the generated pseudo-label predictions and an uniform distribution for both the teacher and student model. At the end of each round, the server aggregates the divergences from all participating clients into a global KL-divergence for the teacher and student model respectively. If the teacher’s mean KL-divergence is closer to _, both the teacher and student model are sent to clients in the next round, and clients use the teacher model to generate pseudo-labels. If the student’s mean KL-divergence is closer to _, only the student model is sent to clients in the next round, and clients use the student model to generate pseudo-labels similar to FedProx-FixMatch”; Zhao, Page 6, Equation 3, “ T ' = { T i f D K L T - β < D K L S - β ∅ o t h e r w i s e ”). Conclusion The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Tian et al., FedBERT: When Federated Learning Meets Pre-training, 08/24/2022, ACM Trans. Intell. Syst. Technol. 13, 4, Article 66 (August 2022), 26 pages. https://doi.org/10.1145/3510033: Tian discloses a method that takes advantage of federated learning and split learning to pre-train BERT in a federated way to grant clients with limited computing capability the ability to participate in pre-training a large model. Any inquiry concerning this communication or earlier communications from the examiner should be directed to MOLLY CLARKE SIPPEL whose telephone number is (571)272-3270. The examiner can normally be reached Monday - Friday, 7:30 a.m. - 4:30 p.m. ET.. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached at (571)272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /M.C.S./ Examiner, Art Unit 2122 /KAKALI CHAKI/ Supervisory Patent Examiner, Art Unit 2122
Read full office action

Prosecution Timeline

Apr 26, 2023
Application Filed
Feb 04, 2026
Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12602592
NOISE COMMUNICATION FOR FEDERATED LEARNING
2y 5m to grant Granted Apr 14, 2026
Patent 12596916
CONSTRAINED MASKING FOR SPARSIFICATION IN MACHINE LEARNING
2y 5m to grant Granted Apr 07, 2026
Study what changed to get past this examiner. Based on 2 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

1-2
Expected OA Rounds
50%
Grant Probability
99%
With Interview (+58.3%)
3y 7m
Median Time to Grant
Low
PTA Risk
Based on 14 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month