Prosecution Insights
Last updated: April 19, 2026
Application No. 18/185,550

MACHINE LEARNING MODEL TRAINING METHOD AND RELATED DEVICE

Non-Final OA §102§103
Filed
Mar 17, 2023
Examiner
TRAN, TAN H
Art Unit
2141
Tech Center
2100 — Computer Architecture & Software
Assignee
Huawei Technologies Co., Ltd.
OA Round
1 (Non-Final)
60%
Grant Probability
Moderate
1-2
OA Rounds
3y 6m
To Grant
92%
With Interview

Examiner Intelligence

Grants 60% of resolved cases
60%
Career Allow Rate
184 granted / 307 resolved
+4.9% vs TC avg
Strong +32% interview lift
Without
With
+31.8%
Interview Lift
resolved cases with interview
Typical timeline
3y 6m
Avg Prosecution
60 currently pending
Career history
367
Total Applications
across all art units

Statute-Specific Performance

§101
14.4%
-25.6% vs TC avg
§103
55.3%
+15.3% vs TC avg
§102
19.2%
-20.8% vs TC avg
§112
6.1%
-33.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 307 resolved cases

Office Action

§102 §103
Notice of Pre-AIA or AIA Status 1. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . DETAILED ACTION 2. This action is in response to the original filing on 03/17/2023. Claims 1-20 are pending and have been considered below. Information Disclosure Statement 3. The information disclosure statement (IDS(s)) submitted on 09/09/2024 is/are in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner. Claim Objections 4. Claims 3, 11-12, and 19-20 are objected to because of the following informalities: Claim 3 recites “before the obtaining at least one first machine learning module” where “before the obtaining at least one first machine learning model” was apparently intended. Claim 11 recites “select, from the plurality of modules, a at least one first neural network module” where “select, from the plurality of modules, at least one first neural network module” was apparently intended. Claim 12 recites “updating weight parameters of the stored plurality of modules based on the at least one updated neural network module” where “updating weight parameters of a plurality of modules stored based on the at least one updated neural network module” was apparently intended. Claim 19 recites “the device … a first data set stored in the first client device” where “the training device … a first data set stored in a first client device” was apparently intended. Claim 20 recites “obtaining at least one first machine learning model corresponding to the first client, wherein the first client is one of the plurality of clients … updating weight parameters of the stored neural network modules” where “obtaining at least one first machine learning model corresponding to a first client, wherein the first client is one of plurality of clients … updating weight parameters of stored neural network modules” was apparently intended. Claim Rejections - 35 USC § 102 5. The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action: A person shall be entitled to a patent unless – (a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention. (a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention. 6. Claims 1, 2, 7, 12-13, 15, and 19-20 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Reisser et al. (U.S. Patent Application Pub. No. US 20230118025 A1). Claim 1: Reisser teaches a machine learning model training method (i.e. federated learning involves collaborative training of neural network models across multiple users, without the need to gather the data at a central location; para. [0065]) performed by a first client device (i.e. A client or user may receive a neural network model at a local device such as a smartphone; para. [0092, 0093]), the machine learning model training comprises a plurality of rounds of iteration, and one of the plurality of rounds of iteration (i.e. multiple gradient updates for parameters w in the inner optimization of objective may be performed for each shard S, thus obtaining local models with parameters Ws. The multiple gradient updates may be referred to as local epochs such as an amount of data passes through the entire local data set, with an abbreviation of E. Each shard may then communicate data corresponding to the local model Ws to the server. In turn, the server updates the global model at round t by averaging the parameters of the local model. This may be referred to as federated averaging; para. [0070, 0089-0091], federated training process in which local updates are repeatedly produced and aggregated, and the server updates the global model “at round t”) comprises: obtaining at least one first machine learning model (i.e. receives a neural network model from a server; para. [0092]), wherein the at least one first machine learning model is selected based on a data feature of a first data set stored in the first client device (i.e. At block 706, the method 700 selects one or more of the specialized models based in part on a characteristic associated with the local dataset. A client or user has a different set of parameters and may select which experts to use based on local data; para. [0093-0095], candidate models being selected using that local data characteristic); performing a training operation on the at least one first machine learning model by using the first data set, to obtain at least one trained first machine learning model (i.e. At block 708, the method 700 generates a personalized model by fine tuning the neural network model based the selected one or more specialized models and the local dataset; para. [0070, 0095], fine-tuning/multiple gradient updates on the local dataset are training operations producing a trained local model); and sending at least one updated module comprised in the at least one trained first machine learning model to a server communicably coupled to the first client device (i.e. Each shard may then communicate data corresponding to the local model Ws to the server; para. [0070, 0082, 0089], local model information/expert specific updates from client/shard to the central server), wherein the updated module is used by the server to update weight parameters of a plurality of modules stored in the server (i.e. At block 604, the method 600 computes a global update for the neural network model based on the local updates from the subset of the multiple users. In some aspects, the global update is computed by aggregating the local updates received from the subset of the multiple user. In some aspects, the neural network model comprises multiple independent neural network models; para. [0070, 0082, 0090], the server aggregates client updates and updates parameters (weights) accordingly). Claim 2: Reisser teaches the method according to claim 1. Reisser further teaches wherein the plurality of modules are configured to construct at least two second machine learning models (i.e. At block 604, the method 600 computes a global update for the neural network model based on the local updates from the subset of the multiple users. In some aspects, the global update is computed by aggregating the local updates received from the subset of the multiple user. In some aspects, the neural network model comprises multiple independent neural network models; para. [0090]), and the at least one first machine learning model is selected from the at least two second machine learning models (i.e. The neural network model is collaboratively trainable across multiple clients via a set of specialized neural network models; para. [0092, 0094]). Claim 7: Reisser teaches the method according to claim 1. Reisser further teaches wherein the machine learning model is a neural network (i.e. At neural network model; para. [0006]), and the method further comprises: receiving a selector sent by the server, wherein the selector is a neural network configured to select, from the plurality of modules, at least one neural network module that matches the data feature of the first data set (i.e. FIG. 7 is a flow diagram illustrating a method 700 for generating a personalized neural network model, according to aspects of the present disclosure. At block 702, the method 700 receives a neural network model from a server. The neural network model is collaboratively trainable across multiple clients via a set of specialized neural network models. Each specialized neural network is associated with a subset of a first dataset. A client or user may receive a neural network model at a local device such as a smartphone, for example. As described, a mixture of experts may model a data set where different subsets of the data exhibit different relationships between input x and output y; para. [0092]); inputting training data into the selector based on the first data set, to obtain indication information output by the selector, wherein the indication information comprises a probability that each of the plurality of modules is selected (i.e. the global parameters of an expert are trained using all data points assigned to that expert across all shards to enable learning more robust features. The robustness of the expert’s features may serve as conditions for the gating function rather than training an entirely separate model for pθs (x|s). Given a set of intermediary features hs(x) of expert k, a local vector πs ∈ ℝK, the global parameters of an expert are trained using all data points assigned to that expert across all shards to enable learning more robust features. The robustness of the expert’s features may serve as conditions for the gating function rather than training an entirely separate model for pθs (x|s). Given a set of intermediary features hs(x) of expert k, a local vector πs ∈ ℝK; para. [0085]), and the indication information indicates a neural network module that constructs at least one first neural network (i.e. The neural network model is collaboratively trainable across multiple clients via a set of specialized neural network models. Each specialized neural network being associated with a subset of a first dataset; para. [0010]); and receiving, from the server, the neural network module that constructs the at least one first neural network (i.e. FIG. 7 is a flow diagram illustrating a method 700 for generating a personalized neural network model, according to aspects of the present disclosure. At block 702, the method 700 receives a neural network model from a server; para. [0092]). Claim 12: Reisser teaches a machine learning model training method (i.e. federated learning involves collaborative training of neural network models across multiple users, without the need to gather the data at a central location; para. [0065]), performed by a server (i.e. a server; para. [0010]), the machine learning model training comprises a plurality of rounds of iteration, and one of the plurality of rounds of iteration (i.e. multiple gradient updates for parameters w in the inner optimization of objective may be performed for each shard S, thus obtaining local models with parameters Ws. The multiple gradient updates may be referred to as local epochs such as an amount of data passes through the entire local data set, with an abbreviation of E. Each shard may then communicate data corresponding to the local model Ws to the server. In turn, the server updates the global model at round t by averaging the parameters of the local model. This may be referred to as federated averaging; para. [0070, 0089-0091], federated training process in which local updates are repeatedly produced and aggregated, and the server updates the global model “at round t”) comprises: obtaining at least one first machine learning model (i.e. receives a neural network model from a server; para. [0092]) corresponding to a first client device, wherein the first client device is one of a plurality of client devices communicably coupled to the server (i.e. receiving a neural network model from a server. The neural network model is collaboratively trainable across multiple clients via a set of specialized neural network models; para. [0010]), and the at least one first machine learning model corresponds to a data feature of a first data set stored in the first client device (i.e. At block 706, the method 700 selects one or more of the specialized models based in part on a characteristic associated with the local dataset. A client or user has a different set of parameters and may select which experts to use based on local data; para. [0093-0095], candidate models being selected using that local data characteristic); sending the at least one first machine learning model to the first client device (i.e. FIG. 7 is a flow diagram illustrating a method 700 for generating a personalized neural network model, according to aspects of the present disclosure. At block 702, the method 700 receives a neural network model from a server. The neural network model is collaboratively trainable across multiple clients via a set of specialized neural network models. Each specialized neural network is associated with a subset of a first dataset. A client or user may receive a neural network model at a local device such as a smartphone; para. [0092]), wherein the at least one first machine learning model indicates the first client device to perform a training operation on the at least one first machine learning model by using the first data set, to obtain at least one trained first machine learning model (i.e. At block 708, the method 700 generates a personalized model by fine tuning the neural network model based the selected one or more specialized models and the local dataset; para. [0070, 0095], fine-tuning/multiple gradient updates on the local dataset are training operations producing a trained local model); and receiving, from the first client device, at least one updated neural network module comprised in the at least one trained first machine learning model, and updating weight parameters of the stored plurality of modules based on the at least one updated neural network module (i.e. At block 604, the method 600 computes a global update for the neural network model based on the local updates from the subset of the multiple users. In some aspects, the global update is computed by aggregating the local updates received from the subset of the multiple users. In some aspects, the neural network model comprises multiple independent neural network models; para. [0070, 0082, 0090], the server aggregates client updates and updates parameters (weights) accordingly). Claim 13: Reisser teaches the method according to claim 12. Reisser further teaches wherein a plurality of modules are stored in the server (i.e. the central server may aggregate the expert specific updates to generate a global update; para. [0082]) and configured to construct at least two second machine learning models (i.e. instead of learning a single global model, S individual models are learned; para. [0089]), and the at least one first machine learning model is selected from the at least two second machine learning models (i.e. the method includes selecting one or more of the specialized models based on a characteristic associated with the local dataset; para [0010]). Claim 15: Reisser teaches the method according to claim 12. Reisser further teaches wherein the machine learning model is a neural network (i.e. each of the experts may be implemented as a separate, independent artificial neural network; para. [0089]), the plurality of modules stored in the server are neural network modules (i.e. The expert specific updates may be supplied to a central server; para. [0082]), and the method further comprises: receiving first identification information sent by the first client device (i.e. The method includes receiving a local update of the neural network model from a subset of multiple users. Each of the local updates is related to one or more subsets of a dataset and includes an indication of the one or more subsets of the dataset to which each local update relates; para. [0006]), wherein the first identification information is identification information of a first neural network or the a neural network module that constructs the first neural network (i.e. The method includes receiving a local update of the neural network model from a subset of multiple users. Each of the local updates is related to one or more subsets of a dataset and includes an indication of the one or more subsets of the dataset to which each local update relates; para. [0006]); and the sending the at least one first machine learning model to the first client device comprises: sending, to the first client device, the first neural network to which first identification information points, or the neural network module that constructs the first neural network and to which the first identification information points (i.e. At block 606, the method 600 transmits the global update to the subset of the multiple users. For example, as described, the global parameters of an expert are trained using all data points assigned to that expert across all shards to enable learning more robust features; para. [0081, 0091]). Claims 19-20 are similar in scope to Claims 1, 12 and are rejected under a similar rationale. Claim Rejections – 35 USC § 103 7. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. 8. Claims 3 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Reisser et al. (U.S. Patent Application Pub. No. US 20230118025 A1) in view of Ghanta et al. (U.S. Patent Application Pub. No. US 20190377984 A1). Claim 3: Reisser teaches the method according to claim 2. Reisser further teaches wherein the machine learning model is a neural network, the plurality of modules stored in the server are neural network modules (i.e. The method includes receiving a neural network model from a server. The neural network model is collaboratively trainable across multiple clients via a set of specialized neural network models; para. [0010]), the first client device stores a first adaptation relationship (i.e. expert models learn to specialize in regions of the input space such that for a given expert, each client’s progress on that expert is aligned. Each client learns which experts are relevant for its shard or portion of the data; para. [0068]), the first adaptation relationship comprises a plurality of adaptation values (i.e. a mixture of experts may model a data set where different subsets of the data exhibit different relationships between input x and output y. Rather than training a single global model to fit this relationship at each client throughout the network, each expert k performs on a different subset of the input space. In some aspects, each expert may specialize on a region of the data set D; para. [0092]), and each of the adaptation values between the first data set and a second neural network (i.e. expert models learn to specialize in regions of the input space such that for a given expert, each client’s progress on that expert is aligned. Each client learns which experts are relevant for its shard or portion of the data; para. [0068, 0085]); before the obtaining at least one first machine learning module, the method further comprises: receiving the plurality of modules sent by the server (i.e. block 702, the method 700 receives a neural network model from a server. The neural network model is collaboratively trainable across multiple clients via a set of specialized neural network models; para. [0092]); and the obtaining at least one first machine learning model comprises: selecting at least one first neural network from at least two second neural networks based on the first adaptation relationship (i.e. At block 706, the method 700 selects one or more of the specialized models based in part on a characteristic associated with the local dataset. A client or user has a different set of parameters and may select which experts to use based on local data; para. [0093-0095]. Reisser does not explicitly teach the adaptation values indicates an adaptation degree, selecting at least one first neural network from at least two second neural networks based on the first adaptation relationship, wherein the at least one first neural network comprises a neural network with a high adaptation value with the first data set. However, Ghanta teaches the first client device stores a first adaptation relationship, the first adaptation relationship comprises a plurality of adaptation values (i.e. A suitability score that satisfies (is equal to or greater than) a suitability threshold indicates that the training data set and the machine learning model trained with the training data set are suitable, accurate, or the like for the inference data set; para. [0095]), and each of the adaptation values indicates an adaptation degree between the first data set and a second neural network (i.e. A suitability score that satisfies (is equal to or greater than) a suitability threshold indicates that the training data set and the machine learning model trained with the training data set are suitable, accurate, or the like for the inference data set; para. [0095]); selecting at least one first neural network from at least two second neural networks based on the first adaptation relationship, wherein the at least one first neural network comprises a neural network with a high adaptation value with the first data set (i.e. the action module 308 is configured to perform an action related to the machine learning system 200 in response to the suitability score satisfying (is equal to or less than) an unsuitability threshold, or in response to the suitability score not satisfying (e.g., is not equal to or greater than) a suitability threshold, which may be the same value as the unsuitability threshold. For example, the score module 306 may calculate a suitability score of 0.75 (or 0.25 unsuitability score) on a scale of 0 to 1, and if the unsuitability threshold is 0.3, or if the suitability threshold is 0.7, then the action module 308 may determine that the training data set, and by extension the machine learning model, is suitable for the particular inference data set; para. [0111, 0112]). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Reisser to include the feature of Ghanta. One would have been motivated to make this modification because improves robustness by detecting mismatch between training characteristics and local data and using that to guide expert selection in a modular ensemble. Claim 14: Reisser teaches the method according to claim 13. Reisser further teaches wherein the plurality of modules stored in the server are neural network modules (i.e. each of the experts may be implemented as a separate, independent artificial neural network; para. [0089]), the first client device stores a first adaptation relationship (i.e. a mixture of experts may model a data set where different subsets of the data exhibit different relationships between input x and output y. Rather than training a single global model to fit this relationship at each client throughout the network, each expert k performs on a different subset of the input space. In some aspects, each expert may specialize on a region of the data set D; para. [0092]), the first adaptation relationship comprises a plurality of adaptation values (i.e. expert models learn to specialize in regions of the input space such that for a given expert, each client’s progress on that expert is aligned. Each client learns which experts are relevant for its shard or portion of the data; para. [0068, 0085]), and the method further comprises: receiving an adaptation value that is between the first data set and at least one second neural network and that is sent by the first client device (i.e. The expert specific updates may be supplied to a central server; para. [0082]); and updating a second adaptation relationship (i.e. the central server may aggregate the expert specific updates to generate a global update; para. [0082]). Reisser does not explicitly teach the first adaptation relationship comprises a plurality of adaptation values, and each of the adaptation values indicates an adaptation degree between the first data set and a second neural network; the obtaining at least one first neural network comprises: selecting at least one first neural network from a plurality of second neural networks based on the second adaptation relationship, wherein the at least one first neural network comprises a neural network with a high adaptation value with the first data set. However, Ghanta teaches the first adaptation relationship comprises a plurality of adaptation values, and each of the adaptation values indicates an adaptation degree between the first data set and a second neural network (i.e. A suitability score that satisfies (is equal to or greater than) a suitability threshold indicates that the training data set and the machine learning model trained with the training data set are suitable, accurate, or the like for the inference data set; para. [0095]); the obtaining at least one first neural network comprises: selecting at least one first neural network from a plurality of second neural networks based on the second adaptation relationship, wherein the at least one first neural network comprises a neural network with a high adaptation value with the first data set (i.e. the action module 308 is configured to perform an action related to the machine learning system 200 in response to the suitability score satisfying (is equal to or less than) an unsuitability threshold, or in response to the suitability score not satisfying (e.g., is not equal to or greater than) a suitability threshold, which may be the same value as the unsuitability threshold. For example, the score module 306 may calculate a suitability score of 0.75 (or 0.25 unsuitability score) on a scale of 0 to 1, and if the unsuitability threshold is 0.3, or if the suitability threshold is 0.7, then the action module 308 may determine that the training data set, and by extension the machine learning model, is suitable for the particular inference data set; para. [0111, 0112]). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Reisser to include the feature of Ghanta. One would have been motivated to make this modification because improves robustness by detecting mismatch between training characteristics and local data and using that to guide expert selection in a modular ensemble. 9. Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Reisser in view of Ghanta, and further in view of Shen et al. (U.S. Patent Application Pub. No. US 20210393229 A1). Claim 4: Reisser and Ghanta teach the method according to claim 3. Reisser further teaches a first loss function (i.e. Such clients may perform a series of mini-batch gradient updates with the data from their shard Ds on a local loss function, which may involve each client moving in possibly different directions in the parameter space; para. [0045, 0081]); and the first loss function indicates a similarity between a prediction result of first data and a correct result of the first data (i.e. an error may be calculated between the output 222 and a target output. The target output is the ground truth of the image 226 (e.g., “sign” and “60”); para. [0045]), the prediction result of the first data is obtained based on the second neural network (i.e. a forward pass may then be computed to produce an output 222; para. [0042-0044]), and the first data and the correct result of the first data are obtained based on the first data set (i.e. federated learning involves learning a server model with parameters W such as a neural network with a data set of N data points D = {(x1,y1), ..., (xN,yN)} that is distributed across shards S or portions; para. [0069, 0070]). Reisser does not explicitly teach wherein an adaptation value corresponds to a function value of a first loss function, and a smaller function value of the first loss function indicates a larger adaptation value between the first data set and the second network. However, Shen teaches wherein an adaptation value between the first data set and the second neural network corresponds to a function value of a first loss function, and a smaller function value of the first loss function indicates a larger adaptation value between the first data set and the second neural network (i.e. the best checkpoint model with the smallest validation loss is selected as final model 120; para. [0046]). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Reisser and Ghanta to include the feature of Shen. One would have been motivated to make this modification because provides a concrete for quantifying model to data fit and selecting the most suitable candidate model for a client dataset. 10. Claims 5 and 6 are rejected under 35 U.S.C. 103 as being unpatentable over Reisser in view of Ghanta, and further in view of Szeto et al. (U.S. Patent Application Pub. No. US 20180018590 A1). Claim 5: Reisser and Ghanta teach the method according to claim 3. Reisser does not explicitly teach wherein an adaptation value between the first data set and the second neural network corresponds to a first similarity between the second neural network and a third neural network, a larger first similarity indicates a larger adaptation value between the first data set and the second neural network and the third neural network is a neural network with highest accuracy of outputting a prediction result in a previous round of iteration. However, Szeto teaches wherein an adaptation value between the first data set and the second neural network corresponds to a first similarity between the second neural network and a third neural network (i.e. The similarity between trained proxy model 270 and trained actual model 240 can be measured through various techniques by modeling engine 226 calculating model similarity score 280 as a function of proxy model parameters 275 and actual model parameters 245. The resulting model similarity score 280 is a representation of how similar the two models are, at least to within similarity criteria; para. [0079]), a larger first similarity indicates a larger adaptation value between the first data set and the second neural network and the third neural network is a neural network with highest accuracy of outputting a prediction result in a previous round of iteration (i.e. Operation 680, similar to operation 560 of FIG. 6, includes the global modeling engine calculating a model similarity score of the trained proxy model relative to the trained actual model(s) as a function of the proxy model parameters and the actual proxy model parameters. The actual proxy model parameters can be obtained along with the salient private data features as discussed with respect to operation 640 or could be obtained upon sending a request to the proxy data's modeling engine. Should the model similarity score fail to satisfy similarity requirements, then the global modeling engine can repeat operations 660 through 680 until a satisfactorily similar trained proxy model is generated; para. [0118]). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Reisser and Ghanta to include the feature of Szeto. One would have been motivated to make this modification because provides a concrete way to quantify how closely a candidate neural network matches a reference neural network and to use that similarity as an adaptation value for selecting candidate neural networks. Claim 6: Reisser, Ghanta, and Szeto teach the method according to claim 5. Reisser does not explicitly teach wherein the first similarity between the second neural network and the third neural network is determined based on: inputting same data to the second neural network and the third neural network, and comparing a similarity between output data of the second neural network and output data of the third neural network; or calculating a similarity between a weight parameter matrix of the second neural network and a weight parameter matrix of the third neural network. However, Szeto further teaches wherein the first similarity between the second neural network and the third neural network is determined based on: inputting same data to the second neural network and the third neural network, and comparing a similarity between output data of the second neural network and output data of the third neural network; or calculating a similarity between a weight parameter matrix of the second neural network and a weight parameter matrix of the third neural network (i.e. Operation 680, similar to operation 560 of FIG. 6, includes the global modeling engine calculating a model similarity score of the trained proxy model relative to the trained actual model(s) as a function of the proxy model parameters and the actual proxy model parameters. The actual proxy model parameters can be obtained along with the salient private data features as discussed with respect to operation 640 or could be obtained upon sending a request to the proxy data's modeling engine. Should the model similarity score fail to satisfy similarity requirements, then the global modeling engine can repeat operations 660 through 680 until a satisfactorily similar trained proxy model is generated; para. [0118]). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Reisser and Ghanta to include the feature of Szeto. One would have been motivated to make this modification because provides a concrete way to quantify how closely a candidate neural network matches a reference neural network and to use that similarity as an adaptation value for selecting candidate neural networks. 11. Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over Reisser in view of Vierck et al. (U.S. Patent Application Pub. No. US 20230316072 A1). Claim 8: Reisser teaches the method according to claim 1. Reisser further teaches wherein the machine learning model is a neural network (i.e. neural network model; para. [0089]), and the plurality of modules stored in the server are neural network modules (i.e. Each of the K experts specialized on a region of the input dataset D. Each of the experts may be implemented as a separate, independent artificial neural network, for example. Each of the K experts may correspond to one or more of the S models; para. [0082, 0089]), and wherein after the obtaining at least one first machine learning model, the method further comprises: calculating an adaptation value between the first data set and each of at least one first neural network (i.e. A gating function controls selection of an expert for given data point of the input dataset; para. 0089]), wherein the first data set comprises a plurality of pieces of first training data (i.e. A mixture of expert models for data point (x, y); para. [0072]). Reisser does not explicitly teach a larger adaptation value between the first training data and each of the at least one first neural network indicates a greater degree of modification of a weight parameter of the first neural network in a process of training the first neural network by using the first training data. However, Vierck teaches a larger adaptation value between the first training data and each of the at least one first neural network indicates a greater degree of modification of a weight parameter (i.e. sample weighting can include telling the model to increase or decrease an amount of loss produced by a given piece of example data … the computed loss values are multiplied by the weights to determine with a set of final loss values; para. [0042, 0043]) of the first neural network in a process of training the first neural network by using the first training data (i.e. the new set of computed loss values to the set of weights such the set of weights is changed from the first state into a second state; para. [0011]). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Reisser to include the feature of Vierck. One would have been motivated to make this modification because it provides predictable mechanisms for controlling the influence of individual training samples on parameter updates. 12. Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Reisser in view of Vierck, and further in view of Balachandar et al. (U.S. Patent Application Pub. No. US 20210049473 A1). Claim 9: Reisser and Vierck teach the method according to claim 8. Reisser further teaches wherein the calculating an adaptation value between the first data set and each of the at least one first neural network comprises: clustering the first data set to obtain at least two data subsets, wherein a first data subset is a subset of the first data set, and the first data subset is one of the at least two data subsets (i.e. The mixture of experts may model a dataset where different subsets of the data exhibit different relationships between input x and output y; para. [0073, 0074]); and generating an adaptation value between the first data subset and one first neural network based on the first data subset and a first loss function (i.e. This objective corresponds to empirical risk minimization over the joint data set D with a loss L(·) for each data point; para. [0070]), wherein the first loss function indicates a similarity between a prediction result of first data and a correct result of the first data, the prediction result of the first data is obtained based on the first neural network (i.e. The mixture of experts may model a dataset where different subsets of the data exhibit different relationships between input x and output y; para. [0070, 0072, 0073]), the first data and the correct result of the first data are obtained based on the first data subset, and the adaptation value between the first data subset and the first neural network is determined as an adaptation value between each piece of data in the first data subset and the first neural network (i.e. The gating function models the decision boundary between input regions, assigning data points from subsets of the input region to their respective experts; para. [0070-0073]). Reisser does not explicitly teach wherein a smaller function value of the first loss function indicates a larger adaptation value between the first data subset and the first neural network. However, Balachandar teaches wherein a smaller function value of the first loss function indicates a larger adaptation value between the first data subset and the first neural network (i.e. some embodiments terminate model learning early, if an amount of iterations or epochs pass without an improvement in validation loss (e.g., model learning terminates if 4000 iterations and/or 20 epochs pass without an improvement in validation loss); para. [0042]). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Reisser and Vierck to include the feature of Balachandar. One would have been motivated to make this modification because it ensures the adaptation value directly reflects real predictive quality of the neural network on the client’s data. 13. Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Reisser in view of Fukuda et al. (U.S. Patent Application Pub. No. US 20200034703 A1). Claim 10: Reisser teaches the method according to claim 1. Reisser further teaches wherein the plurality of modules stored in the server are neural network modules (i.e. Each of the experts may be implemented as a separate, independent artificial neural network; para. [0072]), and the performing a training operation on the at least one first machine learning model by using the first data set (i.e. multiple gradient updates for parameters w in the inner optimization of objective may be performed for each shard S, thus obtaining local models with parameters Ws; para. [0070]) comprises: performing a training operation on each of the at least one first neural network based on a second loss function by using the first data set (i.e. multiple gradient updates for parameters w in the inner optimization of objective may be performed for each shard S, thus obtaining local models with parameters Ws; para. [0070]), wherein the first data set comprises a plurality of pieces of first training data (i.e. A mixture of expert models for data point (x, y); para. [0072]), wherein the second loss function indicates a similarity between a first prediction result and a correct result of the first training data (i.e. an error may be calculated between the output 222 and a target output; para. [0045, 0046]); the first prediction result is a prediction result that is of the first training data and that is output by the first neural network after the first training data is input into the first neural network (i.e. During training, the DCN 200 may be presented with an image, such as the image 226 of a speed limit sign, and a forward pass may then be computed to produce an output 222; para. [0042]). Reisser does not explicitly teach wherein the second loss function further indicates a similarity between the first prediction result and a second prediction result; the second prediction result is a prediction result that is of the first training data and that is output by a fourth neural network after the first training data is input into the fourth neural network; and the fourth neural network is a first neural network on which no training operation is performed. However, Fukuda teaches wherein the second loss function further indicates a similarity between the first prediction result and a second prediction result (i.e. the student training section 150 may train the student neural network, at block 340, such that soft label errors between (1) a soft label output generated by the student neural network in response to receiving the teacher input data (e.g., Input Data 1) and (2) the soft label output generated by the selected teacher neural network (e.g., Teacher NN1) in response to receiving the same teacher input data, is minimized; para. [0050, 0053]); the second prediction result is a prediction result that is of the first training data and that is output by a fourth neural network after the first training data is input into the fourth neural network (i.e. inputting the input data into the teacher neural network, comparing an output data (e.g., a soft label output) of the teacher neural network with the corresponding correct training data; para. [0038]); and the fourth neural network is a first neural network on which no training operation is performed (i.e. The student training section 150 may train a student neural network. The student training section 150 may train the student neural network with at least the teacher input data and the plurality of soft label outputs output from the plurality of teacher neural networks; para. [0025]). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Reisser to include the feature of Fukuda. One would have been motivated to make this modification because it provides predictable mechanisms for controlling the influence of individual training samples on parameter updates. 14. Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Reisser in view of Malik et al. (U.S. Patent Application Pub. No. US 20240112008 A1). Claim 11: Reisser teaches the method according to claim 7. Reisser further teaches wherein the first data set comprises a plurality of pieces of first training data (i.e. A mixture of expert models for data point (x, y);para. [0072]) and a correct result of each piece of first training data (i.e. The target output is the ground truth of the image 226 (e.g., “sign” and “60”); para. [0045]), and the method further comprises: wherein the selector is a neural network configured to select, from the plurality of modules, a at least one first neural network module that matches the data feature of the first data set (i.e. A gating function controls selection of an expert for given data point of the input dataset D; para. [0073, 0074, 0089]); the performing a training operation on the at least one first machine learning model by using the first data set comprises: inputting the first training data into the selector to obtain the indication information output by the selector, wherein the indication information comprises the probability that each of the plurality of modules is selected, and the indication information indicates the neural network module that constructs the first neural network (i.e. the robustness of the expert’s features may serve as conditions for the gating function rather than training an entirely separate model for pθs (x|s). Given a set of intermediary features hs(x) of expert k, a local vector πs ∈ ℝK, with which the intermediate features are averaged before applying a linear transformation to compute the input to the softmax gates, which may scale with the number of experts, where θs = (πs, As, bs) are local learnable parameters and SM represents the softmax function; para. [0072, 0073, 0085]); obtaining, based on the plurality of modules, the indication information and the first training data, a prediction result of the first training data and output by the first neural network (i.e. Each of the experts may be implemented as a separate, independent artificial neural network, for example. Each of the experts may be trained to determine a prediction for its designated region; para. [0072, 0073, 0089]); performing a training operation on the first neural network and the selector based on a third loss function (i.e. Fine-tuning may be performed by optimizing equation 5 for a small number of epochs (e.g., E = 1) with respect to w1:K, ϕ, and θs; para. [0080]), wherein the third loss function indicates a similarity between the prediction result of the first training data and a correct result (i.e. an error may be calculated between the output 222 and a target output; para. [0045, 0073, 0074]), and further indicates a dispersion degree of the indication information (i.e. to avoid prematurely pruning of experts and preserve model capacity, a marginal entropy term in the server H(Ep(y)[qϕ(z|y)]) may be included as a regularizer that encourages using all of the experts; para. [0079]). Reisser does not explicitly teach receiving the selector sent by the server, sending a trained selector to the server. However, Malik teaches receiving the selector sent by the server, sending a trained selector to the server (fig. 8, in a particular round of model training, server 510 may select a subset of client systems 130 including client system 130a to train the global neural network model 820 together with the respective local personalization models 820 of each selected client system 130. Client system 130a may then receive a current version of the global neural network model 820a from server 510. Client system 130a may then retrieve the plurality of examples 530a from the local data store of client system 130a. Client system 130a may then train the received global neural network model 820a together with the local personalization model 830a on the pluralities of examples 530a to generate a plurality of updated federated model parameters and a plurality of updated local model parameters. Client system 130a may then store in the local data store the trained local personalization model 830a including the updated local model parameters. Client system 130a may then send the trained global neural network model 820a including the updated federated model parameters to server 510 without sending any of the examples 530a to server 510; para. [0119]). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Reisser to include the feature of Malik. One would have been motivated to make this modification because it provides predictable mechanisms for controlling the influence of individual training samples on parameter updates. 15. Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Reisser in view of Chen et al. (A Practical Algorithm for Distributed Clustering and Outlier Detection, arXiv, published 2018, pages 1-15), and further in view of Tang et al. (Input partitioning to mixture of experts, IEEE, published 2002, pages 227-232). Claim 16: Reisser teaches the method according to claim 12. Reisser further teaches wherein the machine learning model is a neural network (i.e. each of the experts may be implemented as a separate, independent artificial neural network; para. [0089]), the plurality of modules stored in the server are neural network modules (i.e. The expert specific updates may be supplied to a central server; para. [0082]), the server is further configured with a selector (i.e. A gating function controls selection of an expert for given data point of the input dataset D; para. [0072]), and the method further comprises: determining, based on the indication information, a neural network module that constructs at least one first neural network (i.e. A mixture of experts may include a set of K experts. Each of the K experts specialized on a region of the input dataset D. A gating function controls selection of an expert for given data point of the input dataset D. Each of the experts may be implemented as a separate, independent artificial neural network, for example. Each of the experts may be trained to determine a prediction for its designated region. Thus the gating function determines for each data point of input dataset D, an expert for determining a prediction for the data point; para. [0072]), wherein the indication information comprises a probability that each of the plurality of is selected (i.e. A mixture of expert models for data point (x, y) may be described by: probability equation, where z is a categorical variable that denotes the expert, wk are the parameters of the expert k, and θ are the parameters of the gating function; para [0072]); and the sending the at least one first machine learning model to the first client device comprises: sending, to the first client device, the neural network module that constructs the at least one first neural network (i.e. FIG. 7 is a flow diagram illustrating a method 700 for generating a personalized neural network model, according to aspects of the present disclosure. At block 702, the method 700 receives a neural network model from a server; para. [0092]). Reisser does not explicitly teach receiving at least one clustering center sent by the first client device; and performing a clustering operation on the first data set to obtain at least one data subset, wherein one clustering center in the at least one clustering center is a clustering center of one data subset in the at least one data subset; the obtaining at least one first machine learning model corresponding to the first client device comprises: inputting the clustering center into the selector to obtain indication information output by the selector. However, Chen teaches receiving at least one clustering center sent by the first client device (i.e. Each site constructs a summary of the local dataset using the k-means++ algorithm, and sends it to the coordinator; pages 1-11); and performing a clustering operation on the first data set to obtain at least one data subset, wherein one clustering center in the at least one clustering center is a clustering center of one data subset in the at least one data subset (i.e. Each site constructs a summary of the local dataset using the k-means++ algorithm, and sends it to the coordinator. The coordinator feeds the unions all summaries to k-means-- for a second level clustering; pages 1-11). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Reisser to include the feature of Chen. One would have been motivated to make this modification because it reduces communication and avoids direct access to client raw data while still capturing dataset structure. However, Tang teaches inputting the clustering center into the selector to obtain indication information output by the selector (i.e. the Potential Function method as applied here defines a region ‘center’ point in terms of corresponding SOM node. Each instance of these nodes is forwarded to the gating network of the MoE to provide the ‘global’ view necessary to establish expert interaction; Section II, pages 227-228), and determining, based on the indication information, a neural network module that constructs at least one first neural network (i.e. The network is composed of K experts and one gating network. Each expert is composed of M input nodes and one output node. The gating network is composed of M input nodes, and K output nodes, such that there is a single output for every expert; Section III, page 228). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Reisser and Chen to include the feature of Tang. One would have been motivated to make this modification because it improves expert/module selection using partition-centroid signals in a MoE gating mechanism. 16. Claims 17 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Reisser in view of Liu et al. (U.S. Patent Application Pub. No. US 20170364799 A1). Claim 17: Reisser teaches the method according to claim 12. Reisser further teaches wherein the machine learning model is a neural network (i.e. each of the experts may be implemented as a separate, independent artificial neural network; para. [0089]), the plurality of modules stored in the server are neural network modules (i.e. The expert specific updates may be supplied to a central server; para. [0082]), one neural network is divided into at least two submodules (i.e. the DCN 200 may include a feature extraction section and a classification section; para. [0042]), the neural network modules stored in the server are divided into at least two groups corresponding to at least two submodules, and different neural network modules in a same group have a same function (i.e. each of the experts may be implemented as a separate, independent artificial neural network; para. [0089]), and wherein after the updating weight parameters of the stored neural network modules based on the at least one updated neural network module (i.e. Between each layer 356, 358, 360, 362, 364 of the deep convolutional network 350 are weights (not shown) that are to be updated; para. [0056, 0063]). Reisser does not explicitly teach calculating a similarity between neural network modules in at least two neural network modules comprised in a same group, and combining each two of the neural network modules with similarity greater than a preset threshold. However, Liu teaches calculating a similarity between neural network modules in at least two neural network modules comprised in a same group (i.e. based on the learnable parameters, the simplifying module 160 judges whether the operation executed by a first neuron can be merged into the operation executed by a second neuron. Once the first neuron is merged, one or more neuron connections connected to the first neuron is abandoned accordingly. The simplified neural network 200 in FIG. 2(C) is re-drawn in FIG. 3(A) as an example. First, based on the records in the memory 150, the simplifying module 160 tries to find out at least two weights conforming to both the following requirements: (1) corresponding to the same rear artificial neuron, and (2) having values close to each other (e.g. their difference falls in a predetermined small range); para. [0035]), and combining (i.e. the simplifying module 160 can merge the operation executed by the artificial neuron 121 into the operation executed by the artificial neuron 122; para. [0036]) each two of the neural network modules with similarity greater than a preset threshold (i.e. the simplifying module 160 further judges whether all the weights utilized in the computation of the preceding artificial neurons corresponding to the weights w4 and w5 are lower than a threshold T′; para. [0036]). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Reisser to include the feature of Liu. One would have been motivated to make this modification because it reduces redundancy among same-function modules, thereby lowering memory footprint and compute cost. Claim 18: Reisser and Liu teach the method according to claim 17. Reisser further teaches wherein the different neural network modules comprise a second neural network module and a first neural network module (i.e. each of the experts may be implemented as a separate, independent artificial neural network; para. [0089]). Reisser does not explicitly teach inputting same data to the second neural network module and the first neural network module, and comparing a similarity between output data of the second neural network module and output data of the first neural network module; or calculating a similarity between a weight parameter matrix of the second neural network module and a weight parameter matrix of the first neural network module. However, Liu further teaches inputting same data to the second neural network module and the first neural network module, and comparing a similarity between output data of the second neural network module and output data of the first neural network module; or calculating a similarity between a weight parameter matrix of the second neural network module and a weight parameter matrix of the first neural network module (i.e. based on the learnable parameters, the simplifying module 160 judges whether the operation executed by a first neuron can be merged into the operation executed by a second neuron. Once the first neuron is merged, one or more neuron connections connected to the first neuron is abandoned accordingly. The simplified neural network 200 in FIG. 2(C) is re-drawn in FIG. 3(A) as an example. First, based on the records in the memory 150, the simplifying module 160 tries to find out at least two weights conforming to both the following requirements: (1) corresponding to the same rear artificial neuron, and (2) having values close to each other (e.g. their difference falls in a predetermined small range); para. [0035, 0036]). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Reisser to include the feature of Liu. One would have been motivated to make this modification because it reduces redundancy among same-function modules, thereby lowering memory footprint and compute cost. Conclusion The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure. Moloney et al. (Pub. No. US 20190279082 A1), federated learning of CNN network weights by collecting large quantities of device-generated CNN network weights from a plurality of client devices and using the collected CNN network weights to generate an improved set of server-synchronized CNN network weights (e.g., server-synchronized weights) at a cloud server or other remote computing device that can access the device-generated CNN network weights. It is noted that any citation to specific pages, columns, lines, or figures in the prior art references and any interpretation of the references should not be considered to be limiting in any way. A reference is relevant for all it contains and may be relied upon for all that it would have reasonably suggested to one having ordinary skill in the art. In re Heck, 699 F.2d 1331, 1332-33, 216 U.S.P.Q. 1038, 1039 (Fed. Cir. 1983) (quoting In re Lemelson, 397 F.2d 1006, 1009, 158 U.S.P.Q. 275, 277 (C.C.P.A. 1968)). Any inquiry concerning this communication or earlier communications from the examiner should be directed to TAN TRAN whose telephone number is (303)297-4266. The examiner can normally be reached on Monday - Thursday - 8:00 am - 5:00 pm MT. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Matt Ell can be reached on 571-270-3264. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /TAN H TRAN/Primary Examiner, Art Unit 2141
Read full office action

Prosecution Timeline

Mar 17, 2023
Application Filed
May 02, 2023
Response after Non-Final Action
Jan 22, 2026
Non-Final Rejection — §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12594668
BRAIN-LIKE DECISION-MAKING AND MOTION CONTROL SYSTEM
2y 5m to grant Granted Apr 07, 2026
Patent 12579420
Analog Hardware Realization of Trained Neural Networks
2y 5m to grant Granted Mar 17, 2026
Patent 12579421
Analog Hardware Realization of Trained Neural Networks
2y 5m to grant Granted Mar 17, 2026
Patent 12572850
METHOD FOR IMPLEMENTING MODEL UPDATE AND DEVICE THEREOF
2y 5m to grant Granted Mar 10, 2026
Patent 12572326
DIGITAL ASSISTANT FOR MOVING AND COPYING GRAPHICAL ELEMENTS
2y 5m to grant Granted Mar 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

1-2
Expected OA Rounds
60%
Grant Probability
92%
With Interview (+31.8%)
3y 6m
Median Time to Grant
Low
PTA Risk
Based on 307 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month