Last updated: May 29, 2026
Application No. 17/370,462
FEDERATED TEACHER-STUDENT MACHINE LEARNING

Non-Final OA §103
Filed
Jul 08, 2021
Priority
Jul 09, 2020 — FI 20205739
Examiner
BAKER, EZRA JAMES
Art Unit
2126
Tech Center
2100 — Computer Architecture & Software
Assignee
Nokia Technologies Oy
OA Round
5 (Non-Final)
Interview Optional

— +53.3% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 50% grant rate with +53.3% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 16 resolved cases, 2023–2026
Examiner Intelligence

BAKER, EZRA JAMES View full profile →
Grants 50% of resolved cases
Career Allowance Rate
8 granted / 16 resolved
-5.0% vs TC avg
Strong +53% interview lift
Without
With
+53.3%
Interview Lift
resolved cases with interview
Typical timeline
4y 0m
Avg Prosecution
23 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
5.5%
-34.5% vs TC avg
§103
90.8%
+50.8% vs TC avg
§102
3.7%
-36.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 16 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 01/20/2026 has been entered.


 Status of Claims
	The present application is being examined under the claims filed 01/20/2026.
	Claims 1-9 and 11-21 are pending.

Response to Amendment
This Office Action is in response to Applicant’s communication filed 01/20/2026 in response to office action mailed 08/20/2025. The Applicant’s remarks and any amendments to the claims or specification have been considered with the results that follow.
Response to Arguments
In Remarks pages 11-12, Argument 1
(Examiner summarizes Applicant’s arguments)  Applicant argues that Lu does not disclose the newly amended portion of the claims “update a student machine learning model on the apparatus to a federated student machine learning model”, but instead teaches many-to-one teacher/student learning with many client models used to teach a base cloud model. 
Examiner’s response to Argument 1
	Although Lu does teach using many-to-one teacher/student learning, it is important to note the additional steps that Lu describes in order to update models. Lu teaches a comprehensive system of updating models across cloud and edge devices. To illustrate, examiner points out the steps taken in Lu and how they teach on the limitation as amended, and the rejection is updated to reflect how the amendments change the scope of the claims.
Initial cloud (federated teacher) model training “The cloud trains the very first cloud model from an initial dataset.”
Initial client model (student machine learning model) training “When a new device di joins the system at time T0, it asks the cloud for the latest model. The cloud compresses (see 3.2 for more details) its latest cloud model M0 into a small one and sends it to the device.”
Update client model training (federated student machine learning model training) “device di pulls the latest cloud model M1 and merge it with the current client model mi0 through knowledge distillation, resulting in a new client model mi1. […] After client model mi1 being generated, its parameters are pushed to the cloud”. Examiner notes that after the client model is updated, it is uploaded to the cloud and thus is an updated student model on the apparatus
Cloud model re-training “Once receiving model parameters from N of devices, cloud also updates its model. This is done again by knowledge distillation, but with multiple teacher models” and “Client model update (Stage 3) and cloud model update (Stage 4) happen repeatedly. For example, at time T2, di pulls the latest cloud model M2 (or Mk2 if di was classified into group k) and distills the knowledge into client model mi1 using the data collected during time period T2 −T1, resulting a new client model mi2, and then pushes it to the cloud.”
Therefore, the result of step 3 and step 4 after repeated iterations is updating the cloud’s student machine learning model to a new version which is a federated student machine learning model (step 3) which is performed in dependence on the edge node models (step 4 from previous iterations forms the cloud model of the current iteration). Though the student model is present on the edge nodes during this process before it is uploaded, the claims are open ended and do not exclude involvement of the edge nodes in the updating process.

In Remarks pages 12-13, Argument 2
(Examiner summarizes Applicant’s arguments) Applicant argues that in Examiner’s interpretation of Lu, the teacher is updated in dependence upon the edge nodes and not the student model.
Examiner’s response to Argument 2
	While the teacher is updated in dependence upon the edge nodes, the student is further updated in dependence upon the teacher model. Therefore, the student update depends on the teacher model which depends on the machine learning models of the edge nodes. This means that the student update ultimately depends on the machine learning models of the edge nodes. Moreover, the update as described in the argument above further involves the machine learning model of the edge node in updating the compact model on the node.

In Remarks pages 14-15, Argument 3
(Examiner summarizes Applicant’s arguments) Applicant argues that Bucila is completely silent on teaching a student model which has been updated in dependence upon updated models of edge nodes.
Examiner’s response to Argument 3
In response to applicant's arguments against the references individually, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references.  See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986). Applicant argues that Bucila does not teach portions of the claim for which Lu is relied upon.

In Remarks page 15 , Argument 4
(Examiner summarizes Applicant’s arguments) Applicant argues that none of the cited references, alone nor in combination, teach claim 1. Applicant argues that the independent claims and all dependent claims are allowable for the same reasons. Applicant requests consideration of new claim 21.
Examiner’s response to Argument 4
	The rejection of claim 1 is maintained for the reasons provided above, and the rejections of the dependent claims and analogous independent claims are maintained for similar reasons. Claim 21 is further rejected upon as Lu teaches averaging parameters of the edge nodes.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1, 8-9, 11-12, 19, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over NPL reference “Collaborative Learning between Cloud and End Devices: An Empirical Study on Location Prediction” hereby referred to as Lu in view of NPL reference “Model Compression” herein referred to as Bucila.

Regarding Claim 1 
Lu teaches:
An apparatus for a federated machine learning system, the apparatus being a central node of the federated machine learning system
(Abstract line 8) "This paper proposes Colla, a collaborative learning approach for behavior prediction that allows cloud and devices to learn collectively and continuously"; (page 140 column 1 paragraph 1 line 6) “To realize the benefits of cloud-device collaboration, we let the cloud side[*Examiner notes: cloud mapped to the apparatus, a central node] do the heavy lifting at beginning to train an initial cloud model. Devices then take over to perform incremental learning tasks using their local data, and build their own models (a.k.a., client model) in a distributed way[*Examiner notes: mapped to federated learning] for local inference.”; (page 141 column 2 section 3 line 2) “We illustrate the learning process using a star topology as an example where each device connects directly to the cloud[*Examiner notes: the cloud model is a central node].”; [*Examiner notes: One can compare the “star topology” described in the reference with the layout shown in Figure 1. The “cloud” corresponds to the central node and the “devices” correspond to the edge nodes.]; 

and the apparatus comprising: At least one processor; and at least one memory including computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to
(page 144 column 1 paragraph 3 line 1) "We implement the cloud part of Colla on Microsoft Azure cloud[*Examiner notes: mapped to computer program code]. In specific, we use an Azure virtual machine with Intel Xeon CPU@2.60GHz[*Examiner notes: mapped to at least one processor], 128GB memory and 3TB storage[*Examiner notes: mapped to at least one memory], 8 NVIDIA Tesla Pl00 GPU cards (CUDA 9.0.0) with 16GB GPU memory for experiments"

update a student machine learning model on the apparatus, to a federated student machine learning network on the apparatus
(page 142 column 1 paragraph 2) “When a new device di joins the system at timeT0, it asks the cloud for the latest model. The cloud compresses (see 3.2 for more details) its latest cloud model M0 into a small one[*Examiner notes: student machine learning model on the apparatus] and sends it to the device.”; (page 142 column 1 last paragraph) “For example, at time T2, di pulls the latest cloud model M2 (or Mk2 if di was classified into group k) and distills the knowledge into client model mi1 using the data collected during time period T2 −T1, resulting a new client model mi2, and then pushes it to the cloud[*Examiner notes: federated student machine learning network on the apparatus].”;  [*Examiner notes: The original compressed model on the cloud is the original student machine learning model which is then updated at the appropriate time via the client node. After the model is pushed to the cloud, the result is a “federated student machine learning model on the apparatus”]

in dependence upon updated machine learning models of one or more edge nodes of the federated machine learning system
(page 142 column 2 paragraph 1) “Similarly, cloud updates its model from time to time using received models from devices.”; [*Examiner notes: The compressed model (federated student model) depends on the large cloud model, which depends on the models received from devices (machine learning models of one or more other nodes). Thus the compressed model is updated in dependence upon other nodes]; (page 141 column 2 section 3 line 2) “We illustrate the learning process using a star topology as an example where each device[*Examiner notes: mapped to edge nodes of the federated learning system] connects directly to the cloud [*Examiner notes: One can compare the “star topology” described in the reference with the layout shown in Figure 1. The “cloud” corresponds to the central node and the “devices” correspond to the other/edge nodes.]

wherein the apparatus is configured for a machine learning task and the one or more edge nodes are configured for the same machine learning task as the apparatus
(page 140 top of column 2 bullet point 2) "We study the feasibility of Colla by applying a multifeature RNN network to the problems of smartphone-based location prediction." 

Lu does not explicitly teach:
produce, with a teacher machine learning network on the apparatus, pseudo-labels, for supervised learning of the federated student machine learning network on the apparatus, by using received unlabeled data; and
teach, with the teacher machine learning network on the apparatus, by supervised learning, the federated student machine learning network on the apparatus in dependence upon the received unlabeled data and the produced pseudo-labels
[*Examiner note: Lu does teach (page 142 column 1 paragraph 2) “When a new device di joins the system at time T0, it asks the cloud for the latest model. The cloud compresses (see 3.2 for more details) its latest cloud model M0 into a small one and sends it to the device. di uses the compressed small model as its first client model mi0 to perform inference in the coming time period T = T1 −T0.” Thus Lu does teach compressing “a teacher machine learning network on the apparatus” into a “federated student machine learning network on the apparatus” which is a teacher-student relationship. However, Lu does not explicitly disclose producing pseudo-labels and performing the model compression using the pseudo labels.]

However, Bucila teaches:
produce, with a teacher machine learning network on the apparatus, pseudo-labels, for supervised learning of the federated student machine learning network on the apparatus, by using received unlabeled data; and
(page 535 column 2 first paragraph) “Instead of training the neural net on the original (often small) training set used to train the ensemble, we use the ensemble to label[*Examiner notes: produce pseudo labels] a large unlabeled data set[*Examiner notes: mapped to received unlabeled data] and then train the neural net on this much larger, ensemble labeled, data set”

teach, with the teacher machine learning network on the apparatus, by supervised learning, the federated student machine learning network on the apparatus in dependence upon the received unlabeled data and the produced pseudo-labels
(page 535 column 2 paragraph 1) “In this paper we show how to compress the function that is learned by a complex model into a much smaller, faster model that has comparable performance.”; (page 535 column 2 paragraph 1) “Instead of training the neural net on the original (often small) training set used to train the ensemble, we use the ensemble to label[*Examiner notes: produced pseudo-labels] a large unlabeled data[*Examiner notes: received unlabeled data] set and then train the neural net on this much larger, ensemble labeled, data set. This yields a neural net that makes predictions similar to the ensemble, and which performs much better than a neural net trained on the original training set.”

	Lu, Bucila, and the instant application are analogous because they are all directed to machine learning.
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to modify the federated learning of Lu with the model compression of Bucila as a substitute for the model compression of Lu because (Bucila page 535 column 2 paragraph 1) “This yields a neural net that makes predictions similar to the ensemble, and which performs much better than a neural net trained on the original training set.”

Regarding Claim 8
Lu in view of Bucila teaches:
The apparatus as claimed in claim 1
(see rejection of claim 1)

And Lu further teaches:
wherein the federated student machine learning network is configured to update a student machine learning model of the federated student machine learning network in dependence upon the updated machine learning models of the one or more edge nodes.
(page 143 column 2 paragraph 4) “Cloud Distillation. Cloud distillation uses multiple client models to teach the cloud model[*Examiner notes: mapped to student machine learning model of the federated student machine learning network] by fine-tuning it on the cloud dataset. It aims to match the output probabilities of the cloud model to the average of the softmax output of each client model.”; Figure 2; [*Examiner notes: Lu discloses knowledge distillation in both directions. In cloud distillation, the client models serve as teachers and the cloud model serves as the student. See figure 2 annotated below]

    PNG
    media_image1.png
    269
    699
    media_image1.png
    Greyscale


Regarding Claim 9
Lu in view of Bucila teaches:
The apparatus as claimed in claim 1
(see rejection of claim 1)

And Lu further teaches:
wherein model parameters of the federated student machine learning network are used to update model parameters of one or more another student machine learning networks.
(page 142 column 1 paragraph 4)  “Once receiving model parameters from N of devices, cloud also updates its model. This is done again by knowledge distillation, but with multiple teacher models (i.e., N pushed models from end device di”.

Regarding Claim 11:
Lu in view of Bucila teaches:
The apparatus as claimed in claim 1
(see rejection of claim 1)

And Lu further teaches:
wherein the federated machine learning system has a centralized federated machine learning system
(page 141 column 2 paragraph 5)  “In this section, we describe the collaborative learning framework. It consists of cloud and a set of end devices such as smartphones. We illustrate the learning process using a star topology as an example where each device connects directly to the cloud.”

Regarding Claim 12
Lu teaches:
A method for a federated machine learning system, comprising: 
(Abstract line 8) "This paper proposes Colla, a collaborative learning approach for behavior prediction that allows cloud and devices to learn collectively and continuously"; (page 140 column 1 paragraph 1 line 6) “To realize the benefits of cloud-device collaboration, we let the cloud side do the heavy lifting at beginning to train an initial cloud model. Devices then take over to perform incremental learning tasks using their local data, and build their own models (a.k.a., client model) in a distributed way[*Examiner notes: mapped to federated learning] for local inference.”; [*Examiner notes: According to page 9 of the specification, “Federated learning is a form of collaborative machine learning.” Moreover, federated learning is a term of art which is commonly used synonymously with collaborative learning or distributed learning, and is a machine learning technique focusing on settings in which multiple entities (often referred to as clients) collaboratively train a model while ensuring that their data remains decentralized.]

in a central node, updating a student machine learning model on the central node to a federated student machine learning network of the central node, 
(page 142 column 1 paragraph 2) “When a new device di joins the system at timeT0, it asks the cloud for the latest model. The cloud compresses (see 3.2 for more details) its latest cloud model M0 into a small one[*Examiner notes: student machine learning model on the apparatus] and sends it to the device.”; (page 142 column 1 last paragraph) “For example, at time T2, di pulls the latest cloud model M2 (or Mk2 if di was classified into group k) and distills the knowledge into client model mi1 using the data collected during time period T2 −T1, resulting a new client model mi2, and then pushes it to the cloud[*Examiner notes: federated student machine learning network on the apparatus].”;  [*Examiner notes: The original compressed model on the cloud is the original student machine learning model which is then updated at the appropriate time via the client node. After the model is pushed to the cloud, the result is a “federated student machine learning model on the apparatus”. This process occurs in part on the cloud model (central node), and also includes the edge nodes.]

in dependence upon updated machine learning models of one or more other nodes;
(page 142 column 2 paragraph 1) “Similarly, cloud updates its model from time to time using received models from devices.”; [*Examiner notes: The compressed model (federated student model) depends on the large cloud model, which depends on the models received from devices (machine learning models of one or more other nodes)]; (page 141 column 2 section 3 line 2) “We illustrate the learning process using a star topology as an example where each device[*Examiner notes: mapped to edge nodes of the federated learning system] connects directly to the cloud [*Examiner notes: One can compare the “star topology” described in the reference with the layout shown in Figure 1. The “cloud” corresponds to the central node and the “devices” correspond to the edge nodes.];

wherein the node is configured for a same machine learning task as the one or more other nodes
(page 140 top of column 2 bullet point 2) "We study the feasibility of Colla by applying a multifeature RNN network to the problems of smartphone-based location prediction." 

Lu does not explicitly teach:
producing, with a teacher machine learning network on the central node, pseudo-labels, for supervised learning of the federated student machine learning network on the central node, by using received unlabeled data
and teaching, with the teacher machine learning network on the central node, by using supervised learning, the federated student machine learning network on the central node in dependence upon the received unlabeled data and the produced pseudo-labels
[*Examiner note: Lu does teach (page 142 column 1 paragraph 2)  “When a new device di joins the system at time T0, it asks the cloud for the latest model. The cloud compresses (see 3.2 for more details) its latest cloud model M0 into a small one and sends it to the device. di uses the compressed small model as its first client model mi0 to perform inference in the coming time period T = T1 −T0.” Thus Lu does teach compressing “a teacher machine learning network on the apparatus” into a “federated student machine learning network on the apparatus” which is a teacher-student relationship. However, Lu does not explicitly disclose producing pseudo-labels and performing the model compression using the pseudo labels.]

However, Bucila teaches:
producing, with a teacher machine learning network on the central node, pseudo-labels, for supervised learning of the federated student machine learning network on the central node, by using received unlabeled data
(page 535 column 2 first paragraph) “Instead of training the neural net on the original (often small) training set used to train the ensemble, we use the ensemble to label[*Examiner notes: produce pseudo labels] a large unlabeled data set[*Examiner notes: mapped to received unlabeled data] and then train the neural net on this much larger, ensemble labeled, data set”

and teaching, with the teacher machine learning network on the central node, by using supervised learning, the federated student machine learning network on the central node in dependence upon the received unlabeled data and the produced pseudo-labels
(page 535 column 2 paragraph 1) “In this paper we show how to compress the function that is learned by a complex model into a much smaller, faster model that has comparable performance.”; (page 535 column 2 paragraph 1) “Instead of training the neural net on the original (often small) training set used to train the ensemble, we use the ensemble to label[*Examiner notes: produced pseudo-labels] a large unlabeled data[*Examiner notes: received unlabeled data] set and then train the neural net on this much larger, ensemble labeled, data[*Examiner notes: supervised learning] set. This yields a neural net that makes predictions similar to the ensemble, and which performs much better than a neural net trained on the original training set.”

Lu, Bucila, and the instant application are analogous because they are all directed to machine learning.
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to modify the federated learning of Lu with the model compression of Bucila as a substitute for the model compression of Lu because (Bucila page 535 column 2 paragraph 1) “This yields a neural net that makes predictions similar to the ensemble, and which performs much better than a neural net trained on the original training set.”
	 

Regarding Claim 19
Lu in view of Bucila teaches:
The method as claimed in claim 12
(see rejection of claim 12) 

And Lu further teaches:
further comprising: updating a student machine learning model of the federated student machine learning network in dependence upon the updated machine learning models of the one or more other nodes.
(page 143 column 2 paragraph 4) “Cloud Distillation. Cloud distillation uses multiple client models to teach the cloud model[*Examiner notes: mapped to student machine learning model of the federated student machine learning network] by fine-tuning it on the cloud dataset. It aims to match the output probabilities of the cloud model to the average of the softmax output of each client model.”; Figure 2; [*Examiner notes: Lu discloses knowledge distillation in both directions. In cloud distillation, the client models serve as teachers and the cloud model serves as the student. See figure 2 annotated below]

    PNG
    media_image1.png
    269
    699
    media_image1.png
    Greyscale



Regarding Claim 20
Lu teaches:
An apparatus for a federated machine learning system configured to a teacher- student machine learning mode, 
(Abstract line 8) "This paper proposes Colla, a collaborative learning approach for behavior prediction that allows cloud and devices to learn collectively and continuously"; (page 142 column 2 paragraph 3) “the cloud compresses its large base model into a small one through knowledge distillation[*Examiner notes: mapped to teacher-student learning mode]"


the apparatus being a central node of the federated learning system and the apparatus comprising; at least one processor; and at least one memory including computer program code; the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to:
(page 144 column 1 paragraph 3 line 1) "We implement the cloud part of Colla[*Examiner notes: apparatus being a central node] on Microsoft Azure cloud[*Examiner notes: mapped to computer program code]. In specific, we use an Azure virtual machine with Intel Xeon CPU@2.60GHz[*Examiner notes: mapped to at least one processor], 128GB memory and 3TB storage[*Examiner notes: mapped to at least one memory], 8 NVIDIA Tesla Pl00 GPU cards (CUDA 9.0.0) with 16GB GPU memory for experiments"


send the trained federated student machine learning network to one or more client nodes,
(page 142 column 2 paragraph 3) "As shown in Stage 2 in Figure 2, when a new device comes, the cloud sends the compressed small model to the device."; Lu Figure 2 box 2

receive one or more updated client student machine learning models from one or more client nodes for the sent trained federated student machine learning network
(page 142 column 1 paragraph 4) "Once receiving model parameters from N of devices, cloud also updates its model. This is done again by knowledge distillation, but with multiple teacher model"; (page 143 column 1 paragraph 2 line 9) "Note that the base cloud model is also updated using all the received client models and thus new devices may always get the latest cloud model to start with, before it goes to a group."; [*Examiner notes: When a new device joins the system after the cloud updates its model, the latest cloud model (which was updated using the client machine learning models) is compressed (the compressed model was mapped to the sent trained federated student machine learning model). In this way, the one ore more updated client student machine learning models were received for the sent trained federated student machine learning network. That is, when a new device joins the following steps have occurred: 1. Cloud model is compressed and sent to client devices, 2. Client models train on local data, 3. Cloud model is updated using client models, 4. When new device joins, cloud model is again compressed into a small model and sent to the device]

and update the federated student machine learning network with the one or more updated client student machine learning models.
(page 142 column 1 paragraph 2) "When a new device di joins the system at time T0, it asks the cloud for the latest model. The cloud compresses (see 3.2 for more details) its latest cloud model M0 into a small one and sends it to the device."; [*Examiner notes: The latest cloud model has been updated by client student machine learning models as evidenced by the mapping of the previous limitation. Therefore, compressing the latest cloud model and sending it to the device is updating the federated student machine learning network with the one or more updated client student machine learning models.]

retrain, with the teacher machine learning network on the apparatus, by supervised learning, the one or more updated client student machine learning models
(page 142 column 1 paragraph 3) “After collecting a reasonable amount of data or simply after a fixed time period T , device di pulls the latest cloud model M1 and merge it with the current client model mi0 through knowledge distillation, resulting in a new client model mi1.” (page 143 column 2 paragraph 2) “On the device side, knowledge is transferred from cloud model to client model. Therefore, we use cloud model as teacher model and conduct knowledge distillation on the data available on each device. Follow Equation 2, the loss function of client model i can be calculated as [Equation 2]”; [*Examiner notes: Supervised learning is learning with labeled training data. Equation 3 uses labels from the cloud model (teacher ML on the apparatus) to re-train the client models (client student machine learning model) at set time intervals, or data collection intervals]

Lu does not explicitly teach:
produce, with a teacher machine learning network on the apparatus, pseudo-labels, for supervised learning of a federated student machine learning network on the apparatus, by using received unlabeled data
train, with the teacher machine learning network on the apparatus, by supervised learning, the federated student machine learning network on the apparatus in dependence upon the received unlabeled data and the produced pseudo-labels
[*Examiner note: Lu does teach (page 142 column 1 paragraph 2)  “When a new device di joins the system at time T0, it asks the cloud for the latest model. The cloud compresses (see 3.2 for more details) its latest cloud model M0 into a small one and sends it to the device. di uses the compressed small model as its first client model mi0 to perform inference in the coming time period T = T1 −T0.” Thus Lu does teach compressing “a teacher machine learning network on the apparatus” into a “federated student machine learning network on the apparatus” which is a teacher-student relationship. However, Lu does not explicitly disclose producing pseudo-labels and performing the model compression using the pseudo labels.]

However, Bucila teaches:
produce, with a teacher machine learning network on the apparatus, pseudo-labels, for supervised learning of a federated student machine learning network on the apparatus, by using received unlabeled data
(page 535 column 2 first paragraph) “Instead of training the neural net on the original (often small) training set used to train the ensemble, we use the ensemble to label[*Examiner notes: produce pseudo labels] a large unlabeled data set[*Examiner notes: mapped to received unlabeled data] and then train the neural net on this much larger, ensemble labeled, data set”

train, with the teacher machine learning network on the apparatus, by supervised learning, the federated student machine learning network on the apparatus in dependence upon the received unlabeled data and the produced pseudo-labels
(page 535 column 2 paragraph 1) “In this paper we show how to compress the function that is learned by a complex model into a much smaller, faster model that has comparable performance.”; (page 535 column 2 paragraph 1) “Instead of training the neural net on the original (often small) training set used to train the ensemble, we use the ensemble to label[*Examiner notes: produced pseudo-labels] a large unlabeled data[*Examiner notes: received unlabeled data] set and then train the neural net on this much larger, ensemble labeled, data set. This yields a neural net that makes predictions similar to the ensemble, and which performs much better than a neural net trained on the original training set.”

	Lu, Bucila, and the instant application are analogous because they are all directed to machine learning.
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to modify the federated learning of Lu with the model compression of Bucila as a substitute for the model compression of Lu because (Bucila page 535 column 2 paragraph 1) “This yields a neural net that makes predictions similar to the ensemble, and which performs much better than a neural net trained on the original training set.”

Regarding Claim 21
Lu in view of Bucila teaches:
The apparatus as claimed in claim 1
(see rejection of claim 1)

And Lu further teaches:
wherein the at least one processor is further configured to cause the apparatus to, update the student machine learning model to the federated student machine learning network by at least one of averaging parameters of the updated machine learning models of the one or more edge nodes or weighted averaging the parameters of the updated machine learning models of the one or more edge nodes.
(page 143 column 2 just above equation 4) “Cloud distillation uses multiple client models
to teach the cloud model by fine-tuning it on the cloud dataset. It aims to match the output probabilities of the cloud model to the average of the softmax output of each client model.”; [*Examiner notes: As explained above, the small model (federated student) is updated in dependence on the large model. More succinctly, the method of Lu (1) compresses the large cloud model into a small cloud model (federated student on the apparatus) and sending it to the clients as client models (one or more edge nodes) (2) updates the cloud model by averaging the parameters of the client nodes [see above] (3) updates the small client model by using the updated cloud model, and sending it to the cloud (hence updating the small cloud model). Averaging the parameters is part of this updating process.]

Claims 2-4 and 13-15 are rejected under 35 U.S.C. 103 as being unpatentable over Lu, Bucila, and further in view of NPL reference “KDGAN: Knowledge Distillation with Generative Adversarial Networks” hereby referred to as Wang.

Regarding Claim 2:
Lu in view of Bucila teaches:
The apparatus as claimed in claim 1 
(see rejection of claim 1)

And Bucila further teaches:
receive the unlabeled data
(page 535 column 2 paragraph 2) “In some domains, unlabeled data is easy to obtain. In other domains, however, large data sets (labeled or unlabeled) are not available. In these domains, we generate synthetic cases that as closely as possible match the distribution of the original training set.”

It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to combine Lu and Bucila for the same reasons given in claim 1 above.

Lu in view of Bucila does not teach:
further comprising an adversarial machine learning network that is configured to cause the apparatus to:
receive the produced pseudo-labels from the teacher machine learning network, 
receive label-estimates from the federated student machine learning network, 
and provide an adversarial loss to the teacher machine learning network, for training the teacher machine learning network.

However, Wang teaches:
further comprising an adversarial machine learning network that is configured to cause the apparatus to:
(Abstract line 12) “The classifier and the teacher learn from each other via distillation losses and are adversarially trained against the discriminator via adversarial losses.” 

receive the produced pseudo-labels from the teacher machine learning network, receive label-estimates from the federated student machine learning network, 
(page 2 paragraph 4 line 5) “Specifically, the classifier and the teacher, serving as generators, aim to fool the discriminator by generating pseudo labels that resemble the true labels.”; [*Examiner notes: the classifier’s pseudo label provided to the discriminator is a label-estimate]

and provide an adversarial loss to the teacher machine learning network, 
(page 2 paragraph 4 line 3) "In addition to the distillation loss in KD and the adversarial losses in NaGAN mentioned above, we define a distillation loss from the classifier to the teacher and an adversarial loss between the teacher and the discriminator."; Figure 2

for training the teacher machine learning network.
(abstract) "The classifier and the teacher learn from each other via distillation losses and are adversarially trained against the discriminator via adversarial losses."; Algorithm 1 at the top of page 5.
The combination would teach the adversarial network of Wang configured to receive the unlabeled data of Bucila, and thus the combination teaches an adversarial machine learning network that is configured to cause the apparatus to: receive the unlabeled data.

	Lu, Bucila, Wang, and the instant application are analogous because they are all directed towards machine learning.
It would be obvious to a person having ordinary skill in the art before the effective filing date of the present invention to modify the teacher and student machine learning models and unlabeled data of the Lu in view of Bucila so that the adversarial network (discriminator) of Wang provides an adversarial loss to the teacher machine learning network and federated student machine learning network of Lu because combining knowledge distillation (as referenced in Lu) with generative neural networks (as referenced in Wang) in this manner (Wang page 7 paragraph 3 line 3) "consistently outperforms the KD-based methods by a large margin".

Regarding Claim 3:
Lu in view of Bucila teaches:
The apparatus as claimed in claim 1
(see rejection of claim 1)

And Bucila further teaches:
receive the unlabeled data
(page 535 column 2 paragraph 2) “In some domains, unlabeled data is easy to obtain. In other domains, however, large data sets (labeled or unlabeled) are not available. In these domains, we generate synthetic cases that as closely as possible match the distribution of the original training set.”

It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to combine Lu and Bucila for the same reasons given in claim 1 above.

Lu in view of Bucila does not teach:
further comprising an adversarial machine learning network that is configured to cause the apparatus to: 
receive the produced pseudo-labels from the teacher machine learning network, receive label-estimates from the federated student machine learning network
and provide an adversarial loss to the federated student machine learning network 
for training the federated student machine learning network

However, Wang teaches:
further comprising an adversarial machine learning network that is configured to cause the apparatus to: 
(Abstract line 12) “The classifier and the teacher learn from each other via distillation losses and are adversarially trained against the discriminator via adversarial losses.” 

receive the produced pseudo-labels from the teacher machine learning network, receive label-estimates from the federated student machine learning network
(page 2 paragraph 4 line 5) “Specifically, the classifier and the teacher, serving as generators, aim to fool the discriminator by generating pseudo labels that resemble the true labels.”; [*Examiner notes: the classifier’s pseudo label provided to the discriminator is a label-estimate]

and provide an adversarial loss to the federated student machine learning network 
(page 2 paragraph 4 line 3) "In addition to the distillation loss in KD and the adversarial losses in NaGAN mentioned above, we define a distillation loss from the classifier to the teacher and an adversarial loss between the teacher and the discriminator"; Figure 2

for training the federated student machine learning network.
(abstract) "The classifier and the teacher learn from each other via distillation losses and are adversarially trained against the discriminator via adversarial losses."; Algorithm 1 at the top of page 5
The combination would teach the adversarial network of Wang configured to receive the unlabeled data of Bucila, and thus the combination teaches an adversarial machine learning network that is configured to cause the apparatus to: receive the unlabeled data.

Lu, Bucila, Wang, and the instant application are analogous because they are all directed towards machine learning.
It would be obvious to a person having ordinary skill in the art before the effective filing date of the present invention to modify the teacher and student machine learning models and unlabeled data Lu in view of Bucila so that the adversarial network (discriminator) of Wang provides an adversarial loss to the teacher machine learning network and federated student machine learning network of Lu because combining knowledge distillation (as referenced in Lu) with generative neural networks (as referenced in Wang) in this manner (Wang page 7 paragraph 3 line 3) "consistently outperforms the KD-based methods by a large margin".

Regarding Claim 4:
Lu in view of Bucila teaches:
The apparatus as claimed in claim 1,
(see rejection of claim 1)

And Bucila further teaches:
receive the unlabeled data
(page 535 column 2 paragraph 2) “In some domains, unlabeled data is easy to obtain. In other domains, however, large data sets (labeled or unlabeled) are not available. In these domains, we generate synthetic cases that as closely as possible match the distribution of the original training set.”
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to combine Lu and Bucila for the same reasons given in claim 1 above.

Lu in view of Bucila does not teach:
further comprising an adversarial machine learning network that is configured to cause the apparatus to: 
receive the produced pseudo-labels from the teacher machine learning network, receive label-estimates from the federated student machine learning network,
and provide an adversarial loss to the teacher machine learning network and the federated student machine learning network for training simultaneously and/or parallelly the federated student machine learning network and the teacher machine learning network.

However, Wang teaches:
further comprising an adversarial machine learning network that is configured to cause the apparatus to: 
(Abstract line 12) “The classifier and the teacher learn from each other via distillation losses and are adversarially trained against the discriminator via adversarial losses.” 

receive the produced pseudo-labels from the teacher machine learning network, receive label-estimates from the federated student machine learning network,
(page 2 paragraph 4 line 5) “Specifically, the classifier and the teacher, serving as generators, aim to fool the discriminator by generating pseudo labels that resemble the true labels.”; [*Examiner notes: the classifier’s pseudo label provided to the discriminator is a label-estimate]

and provide an adversarial loss to the teacher machine learning network and the federated student machine learning network for training simultaneously and/or parallelly the federated student machine learning network and the teacher machine learning network.
(abstract) "The classifier and the teacher learn from each other via distillation losses and are adversarially trained against the discriminator via adversarial losses. By simultaneously optimizing the distillation and adversarial losses, the classifier will learn the true data distribution at the equilibrium."; Algorithm 1 at the top of page 5 shows simultaneous learning.

The combination would teach the adversarial network of Wang configured to receive the unlabeled data of Bucila, and thus the combination teaches an adversarial machine learning network that is configured to cause the apparatus to: receive the unlabeled data.

	Lu, Bucila, Wang, and the instant application are analogous because they are all directed towards machine learning.
It would be obvious to a person having ordinary skill in the art before the effective filing date of the present invention to modify the teacher and student machine learning models and unlabeled data of Lu in view of Bucila so that the adversarial network (discriminator) of Wang provides an adversarial loss to the teacher machine learning network and federated student machine learning network of Lu because combining knowledge distillation (as referenced in Lu) with generative neural networks (as referenced in Wang) in this manner "consistently outperforms the KD-based methods by a large margin" (Wang page 7 paragraph 3 line 3).

Regarding Claim 13
Lu in view of Bucila teaches:
The method as claimed in claim 12
(see rejection of claim 12)

And Bucila further teaches:
receiving the unlabeled data
(page 535 column 2 paragraph 2) “In some domains, unlabeled data is easy to obtain. In other domains, however, large data sets (labeled or unlabeled) are not available. In these domains, we generate synthetic cases that as closely as possible match the distribution of the original training set.”
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to combine Lu and Bucila for the same reasons given in claim 12 above.

Lu in view of Bucila does not teach:
wherein the node further includes an adversarial machine learning network, the method further comprising: 
receiving the produced pseudo-labels from the teacher machine learning network, receiving label-estimates from the federated student machine learning network, 
and providing an adversarial loss to the teacher machine learning network for training the teacher machine learning network.

However, Wang teaches:
wherein the node further includes an adversarial machine learning network, the method further comprising: 
(Abstract line 12) “The classifier and the teacher learn from each other via distillation losses and are adversarially trained against the discriminator via adversarial losses.” 

receiving the produced pseudo-labels from the teacher machine learning network, receiving label-estimates from the federated student machine learning network, 
(page 2 paragraph 4 line 5) “Specifically, the classifier and the teacher, serving as generators, aim to fool the discriminator by generating pseudo labels that resemble the true labels.”; [*Examiner notes: the classifier’s pseudo label provided to the discriminator is a label-estimate]

and providing an adversarial loss to the teacher machine learning network 
(page 2 paragraph 4 line 3) "In addition to the distillation loss in KD and the adversarial losses in NaGAN mentioned above, we define a distillation loss from the classifier to the teacher and an adversarial loss between the teacher and the discriminator"; Figure 2

for training the teacher machine learning network.
(abstract) "The classifier and the teacher learn from each other via distillation losses and are adversarially trained against the discriminator via adversarial losses."; Algorithm 1 at the top of page 5
The combination would teach the adversarial network of Wang configured to receive the unlabeled data of Bucila, and thus the combination teaches an adversarial machine learning network that is configured to cause the apparatus to: receive the unlabeled data.

	Lu, Bucila, Wang, and the instant application are analogous because they are all directed towards machine learning.
It would be obvious to a person having ordinary skill in the art before the effective filing date of the present invention to modify the teacher and student machine learning models and unlabeled data of the Lu in view of Bucila so that the adversarial network (discriminator) of Wang provides an adversarial loss to the teacher machine learning network and federated student machine learning network of Lu because combining knowledge distillation (as referenced in Lu) with generative neural networks (as referenced in Wang) in this manner "consistently outperforms the KD-based methods by a large margin" (Wang page 7 paragraph 3 line 3).

Regarding Claim 14
Lu in view of Bucila teaches:
The method as claimed in claim 12
(see rejection of claim 12)

And Bucila further teaches:
receiving the unlabeled data, 
(page 535 column 2 paragraph 2) “In some domains, unlabeled data is easy to obtain. In other domains, however, large data sets (labeled or unlabeled) are not available. In these domains, we generate synthetic cases that as closely as possible match the distribution of the original training set.”
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to combine Lu and Bucila for the same reasons given in claim 12 above.

Lu in view of Bucila does not teach:
wherein the node further includes an adversarial machine learning network, the method further comprising:
(Abstract line 12) “The classifier and the teacher learn from each other via distillation losses and are adversarially trained against the discriminator via adversarial losses.” 

receiving the produced pseudo-labels from the teacher machine learning network, receiving label-estimates from the federated student machine learning network, 
and providing an adversarial loss to the federated student machine learning network for training the federated student machine learning network.

However Wang teaches:
wherein the node further includes an adversarial machine learning network, the method further comprising:
(page 2 paragraph 4 line 5) “Specifically, the classifier and the teacher, serving as generators, aim to fool the discriminator by generating pseudo labels that resemble the true labels.”; [*Examiner notes: the classifier’s pseudo label provided to the discriminator is a label-estimate]

receiving the produced pseudo-labels from the teacher machine learning network, receiving label-estimates from the federated student machine learning network,
(page 2 paragraph 4 line 5) “Specifically, the classifier and the teacher, serving as generators, aim to fool the discriminator by generating pseudo labels that resemble the true labels.”; [*Examiner notes: the classifier’s pseudo label provided to the discriminator is a label-estimate]

and providing an adversarial loss to the federated student machine learning network for training the federated student machine learning network
(page 2 paragraph 4 line 3) "In addition to the distillation loss in KD and the adversarial losses in NaGAN mentioned above, we define a distillation loss from the classifier to the teacher and an adversarial loss between the teacher and the discriminator"; Figure 2
The combination would teach the adversarial network of Wang configured to receive the unlabeled data of Bucila, and thus the combination teaches an adversarial machine learning network that is configured to cause the apparatus to: receive the unlabeled data.


	Lu, Bucila, Wang, and the instant application are analogous because they are all directed towards machine learning.
It would be obvious to a person having ordinary skill in the art before the effective filing date of the present invention to modify the teacher and student machine learning models and unlabeled data of the Lu in view of Bucila so that the adversarial network (discriminator) of Wang provides an adversarial loss to the teacher machine learning network and federated student machine learning network of Lu because combining knowledge distillation (as referenced in Lu) with generative neural networks (as referenced in Wang) in this manner "consistently outperforms the KD-based methods by a large margin" (Wang page 7 paragraph 3 line 3).

Regarding Claim 15
Lu in view of Bucila teaches:
The method as claimed in claim 12
(see rejection of claim 12)

And Bucila further teaches:
receiving the unlabeled data,
(page 535 column 2 paragraph 2) “In some domains, unlabeled data is easy to obtain. In other domains, however, large data sets (labeled or unlabeled) are not available. In these domains, we generate synthetic cases that as closely as possible match the distribution of the original training set.”
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to combine Lu and Bucila for the same reasons given in claim 12 above.

However Wang teaches:
wherein the node further includes an adversarial machine learning network, the method further comprising:
(page 2 paragraph 4 line 5) “Specifically, the classifier and the teacher, serving as generators, aim to fool the discriminator by generating pseudo labels that resemble the true labels.”; [*Examiner notes: the classifier’s pseudo label provided to the discriminator is a label-estimate]

receiving the produced pseudo-labels from the teacher machine learning network, receiving label-estimates from the federated student machine learning network, 
(page 2 paragraph 4 line 5) “Specifically, the classifier and the teacher, serving as generators, aim to fool the discriminator by generating pseudo labels that resemble the true labels.”; [*Examiner notes: the classifier’s pseudo label provided to the discriminator is a label-estimate]

and providing an adversarial loss to the teacher machine learning network and the federated student machine learning network for training simultaneously and/or parallelly the federated student machine learning network and the teacher machine learning network
(abstract) "The classifier and the teacher learn from each other via distillation losses and are adversarially trained against the discriminator via adversarial losses. By simultaneously optimizing the distillation and adversarial losses, the classifier will learn the true data distribution at the equilibrium."; Algorithm 1 at the top of page 5 shows simultaneous learning.
The combination would teach the adversarial network of Wang configured to receive the unlabeled data of Bucila, and thus the combination teaches an adversarial machine learning network that is configured to cause the apparatus to: receive the unlabeled data.


	Lu, Bucila, Wang, and the instant application are analogous because they are all directed towards machine learning.
It would be obvious to a person having ordinary skill in the art before the effective filing date of the present invention to modify the teacher and student machine learning models and unlabeled data of the Lu in view of Bucila so that the adversarial network (discriminator) of Wang provides an adversarial loss to the teacher machine learning network and federated student machine learning network of Lu because combining knowledge distillation (as referenced in Lu) with generative neural networks (as referenced in Wang) in this manner "consistently outperforms the KD-based methods by a large margin" (Wang page 7 paragraph 3 line 3).

Claims 5-7 and 16-18 are rejected under 35 U.S.C. 103 as being unpatentable over Lu, Bucila, and further in view of NPL reference “Deep Clustering for Unsupervised Learning of Visual Features” herein referred to as Caron.

Regarding Claim 5
Lu in view of Bucila teaches:
The apparatus as claimed in claim 1,
(see rejection of claim 1)

And Lu further teaches
wherein the supervised learning in dependence upon the received unlabeled data and the produced pseudo-labels further comprises supervised learning of the federated student machine learning network 
“Therefore, we use cloud model as teacher model and conduct knowledge distillation on the data available on each device. Follow Equation 2, the loss function of client model i can be calculated as” (Lu page 143 column 2 paragraph 2); Equations 2 and 3 include data labels

Lu in view of Bucila does not explicitly teach:
and, as an auxiliary task, unsupervised learning of the teacher machine learning network

However Caron teaches:
and, as an auxiliary task, unsupervised learning of the teacher machine learning network
(Abstract line 9) "We apply DeepCluster to the unsupervised training of convolutional neural networks on large datasets like ImageNet and YFCC100M."; [*Examiner notes: Caron teaches unsupervised learning as a task, and the combination teaches the teacher machine learning network with the task of teaching the student machine learning network of Lu in view of Bucila with the additional clustering task of Caron. Thus the combination teaches as an auxiliary task, unsupervised learning of the teacher machine learning network.]

	Lu, Bucila, Caron, and the instant application are analogous as they are all directed to machine learning.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to combine the teachings of Lu in view of Bucila with the unsupervised learning of Caron as an auxiliary task of the cloud model (teacher) because (Caron, page 15 paragraph 1 line 4) "If trained on large dataset like ImageNet or YFCC100M, it achieves performance that are significantly better than the previous state-of-the-art on every standard transfer task".

Regarding Claim 6
Lu in view of Bucila teaches:
The apparatus as claimed in claim 1, 
(see rejection of claim 1)

Lu in view of Bucila does not teach:
further configured to cause the apparatus to cluster by unsupervised learning of the teacher machine learning network so that intra-cluster mean distance is minimized and inter-cluster mean distance is maximized.

However, Caron teaches:
further configured to cause the apparatus to cluster by unsupervised learning of the teacher machine learning network 
(page 1 paragraph 2 line 1) “In this paper, we make the following contributions: (i) a novel unsupervised method”; (page 5 paragraph 2 line 9) "We cluster the output of the convnet and use the subsequent cluster assignments as 'pseudo-labels'”.

so that intra-cluster mean distance is minimized and inter-cluster mean distance is maximized.
(page 5 paragraph 3 line 3) "we focus on a standard clustering algorithm, k-means." [*Examiner notes: the standard k-means clustering algorithm minimizes intra-cluster mean distance and maximizes inter-cluster mean distance.]

	Lu, Bucila, Caron, and the instant application are analogous as they are all directed to machine learning.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to combine the teachings of Lu in view of Bucila with the unsupervised learning of Caron as an auxiliary task of the cloud model (teacher) because (Caron, page 15 paragraph 1 line 4) "If trained on large dataset like ImageNet or YFCC100M, it achieves performance that are significantly better than the previous state-of-the-art on every standard transfer task".

Regarding Claim 7
Lu in view of Bucila teaches:
The apparatus as claimed in claim 1, 
(see rejection of claim 1)

Lu in view of Bucila does not teach:
wherein the teacher machine learning network is further configured to cause the apparatus to cluster the received unlabeled data and the produced pseudo-labels
so that intra-cluster mean distance is minimized and inter-cluster mean distance is maximized.

However, Caron teaches:
wherein the teacher machine learning network is further configured to cause the apparatus to cluster the received unlabeled data and the produced pseudo-labels
(page 1 paragraph 2 line 1) “In this paper, we make the following contributions: (i) a novel unsupervised method”; (page 5 paragraph 2 line 9) "We cluster the output of the convnet and use the subsequent cluster assignments as 'pseudo-labels'”.

so that intra-cluster mean distance is minimized and inter-cluster mean distance is maximized.
(page 5 paragraph 3 line 3) "we focus on a standard clustering algorithm, k-means." [*Examiner notes: the standard k-means clustering algorithm minimizes intra-cluster mean distance and maximizes inter-cluster mean distance.]

Lu, Bucila, Caron, and the instant application are analogous as they are all directed to machine learning.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to combine the teachings of Lu in view of Bucila with the unsupervised learning of Caron as an auxiliary task of the cloud model (teacher) because (Caron, page 15 paragraph 1 line 4) "If trained on large dataset like ImageNet or YFCC100M, it achieves performance that are significantly better than the previous state-of-the-art on every standard transfer task".


Regarding Claim 16
Lu in view of Bucila teaches:
The method as claimed in claim 12
(see rejection of claim 12)

wherein the supervised learning in dependence upon the received unlabeled data and the produced pseudo-labels further comprises supervised learning of the federated student machine learning network 
(Lu page 143 column 2 paragraph 2) “Therefore, we use cloud model as teacher model and conduct knowledge distillation on the data available on each device. Follow Equation 2, the loss function of client model i can be calculated as”; Equations 2 and 3 include data labels

Lu in view of Bucila does not teach:
and, as an auxiliary task, unsupervised learning of the teacher machine learning network

However, Caron teaches:
and, as an auxiliary task, unsupervised learning of the teacher machine learning network
(Abstract line 9) "We apply DeepCluster to the unsupervised training of convolutional neural networks on large datasets like ImageNet and YFCC100M."; [*Examiner notes: Caron teaches unsupervised learning as a task, and the combination teaches the teacher machine learning network with the task of teaching the student machine learning network of Lu in view of Bucila with the additional clustering task of Caron. Thus the combination teaches as an auxiliary task, unsupervised learning of the teacher machine learning network.]

Lu, Bucila, Caron, and the instant application are analogous as they are all directed to machine learning.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to combine the teachings of Lu in view of Bucila with the unsupervised learning of Caron as an auxiliary task of the cloud model (teacher) because (Caron, page 15 paragraph 1 line 4) "If trained on large dataset like ImageNet or YFCC100M, it achieves performance that are significantly better than the previous state-of-the-art on every standard transfer task".

Regarding Claim 17
Lu in view of Bucila teaches:
The method as claimed in claim 12,
(see rejection of claim 12)
 
Lu in view of Bucila does not teach:
further comprising: clustering by-unsupervised learning of the teacher machine learning network
so that intra-cluster mean distance is minimized and inter-cluster mean distance is maximized.

However Caron teaches:
further comprising: clustering by-unsupervised learning of the teacher machine learning network
(page 1 paragraph 2 line 1) “In this paper, we make the following contributions: (i) a novel unsupervised method”; (page 5 paragraph 2 line 9) "We cluster the output of the convnet and use the subsequent cluster assignments as 'pseudo-labels'”.

so that intra-cluster mean distance is minimized and inter-cluster mean distance is maximized.
(page 5 paragraph 3 line 3) "we focus on a standard clustering algorithm, k-means." [*Examiner notes: the standard k-means clustering algorithm minimizes intra-cluster mean distance and maximizes inter-cluster mean distance.]

Lu, Bucila, Caron, and the instant application are analogous as they are all directed to machine learning.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to combine the teachings of Lu/Bucila with the unsupervised learning of Caron as an auxiliary task of the cloud model (teacher) because (Caron, page 15 paragraph 1 line 4) "If trained on large dataset like ImageNet or YFCC100M, it achieves performance that are significantly better than the previous state-of-the-art on every standard transfer task".

Regarding Claim 18
Lu in view of Bucila teaches:
The method as claimed in 12, 
(see rejection of claim 12)

Lu in view of Bucila does not teach:
wherein the teacher machine learning network is further configured for clustering the received unlabeled data and the produced pseudo-labels 
so that intra-cluster mean distance is minimized and inter-cluster mean distance is maximized.

However Caron teaches:
wherein the teacher machine learning network is further configured for clustering the received unlabeled data and the produced pseudo-labels 
(page 5 paragraph 3 line 3) "we focus on a standard clustering algorithm, k-means." [*Examiner notes: the standard k-means clustering algorithm minimizes intra-cluster mean distance and maximizes inter-cluster mean distance.]

so that intra-cluster mean distance is minimized and inter-cluster mean distance is maximized.
(page 5 paragraph 3 line 3) "we focus on a standard clustering algorithm, k-means." [*Examiner notes: the standard k-means clustering algorithm minimizes intra-cluster mean distance and maximizes inter-cluster mean distance.]

Lu, Bucila, Caron, and the instant application are analogous as they are all directed to machine learning.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to combine the teachings of Lu in view of Bucila with the unsupervised learning of Caron as an auxiliary task of the cloud model (teacher) because (Caron, page 15 paragraph 1 line 4) "If trained on large dataset like ImageNet or YFCC100M, it achieves performance that are significantly better than the previous state-of-the-art on every standard transfer task".

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: NPL reference Lin et al. “Ensemble Distillation for Robust Model Fusion in Federated Learning” teaches updating a student model on an apparatus to a federated student model on the apparatus (see page 3 last paragraph).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Ezra J Baker whose telephone number is (703)756-1087. The examiner can normally be reached Monday - Friday 10:00 am - 8:00 pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, David Yi can be reached at (571) 270-7519. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/E.J.B./Examiner, Art Unit 2126                                                                                                                                                                                                        
/DAVID YI/Supervisory Patent Examiner, Art Unit 2126
Read full office action
Prosecution Timeline

Show 8 earlier events
Apr 08, 2025
Request for Continued Examination
Apr 14, 2025
Response after Non-Final Action
May 01, 2025
Non-Final Rejection mailed — §103
Aug 01, 2025
Response Filed
Aug 20, 2025
Final Rejection mailed — §103
Jan 20, 2026
Request for Continued Examination
Jan 27, 2026
Response after Non-Final Action
Mar 25, 2026
Non-Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/863,840
Patent 12619886
Frozen Model Adaptation Through Soft Prompt Transfer
3y 9m to grant Granted May 05, 2026
17/559,159
Patent 12608619
SUPERSEDED FEDERATED LEARNING
4y 4m to grant Granted Apr 21, 2026
17/455,252
Patent 12585964
EXHAUSTIVE LEARNING TECHNIQUES FOR MACHINE LEARNING ALGORITHMS
4y 4m to grant Granted Mar 24, 2026
17/475,901
Patent 12579477
FEATURE SELECTION USING FEEDBACK-ASSISTED OPTIMIZATION MODELS
4y 6m to grant Granted Mar 17, 2026
17/460,373
Patent 12505379
COMPUTER-READABLE RECORDING MEDIUM STORING MACHINE LEARNING PROGRAM, MACHINE LEARNING METHOD, AND INFORMATION PROCESSING DEVICE OF IMPROVING PERFORMANCE OF LEARNING SKIP IN TRAINING MACHINE LEARNING MODEL
4y 3m to grant Granted Dec 23, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

5-6
Expected OA Rounds
50%
Grant Probability
99%
With Interview (+53.3%)
4y 0m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 16 resolved cases by this examiner. Grant probability derived from career allowance rate.