Office Action Analysis: 18331929 — METHOD FOR SEMI-ASYNCHRONOUS FEDERATED LEARNING AND COMMUNICATION APPARATUS

Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
	This action is made non-final.
Claims 1-15 are pending. Claims 1, 8 and 13 are independent claims.
	
	
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-3, 8, 9 and 13-15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Garg et al. (US 20200027033 A1), herein Garg, in view of Chen et al. (US 20150370844 A1), herein Chen, and Choudhary et al. (US 20190385043 A1), herein Choudhary.
Regarding claim 1, Garg teaches: A method for semi-asynchronous federated learning, comprising: sending, by a computing node, a first parameter to some or all of K subnodes in a tth round of iteration (Abstract, The global parameters are sent from the global server to the plurality of edge servers at each iteration, thereby synchronizing the machine learning model on the edge servers), wherein the first parameter comprises a first global model and a first timestamp t – 1, the first global model is a global model generated by the computing node in a (t – 1)th round of iteration, t is an integer greater than or equal to 1 (¶72, Additionally or alternatively, parameter receiving module 126 receives global parameters of a machine learning model in a payload portion of a data packet. A header portion of the data packet can indicate any suitable information, such as a timestamp of when the global parameters were generated by global server 108, an iteration number. Also note that this data packet possibly includes information about the weights used to modify the local model parameter updates – ¶72, can indicate any suitable information, such as… statistics about the global parameters, such as data regarding weights and edge servers that provided parameter updates contributing to updating the global parameters – and to clarify that “weights” refers to the weights used in the weighted average update aggregation process: ¶74, along with any suitable information, such as an iteration number, weights used to update the global parameters by global server system), and the K subnodes are subnodes that participate in model training (Abstract, Local parameters of the machine learning model are updated at a plurality of edge servers using fresh data on the edge servers); receiving, by the computing node in the tth round of iteration, a second parameter sent by at least one subnode (¶81, Parameter sending module 130 is representative of functionality configured to send updated local parameters to a global server), wherein the second parameter comprises a first local model and a first version number t', the first version number t' indicates that the first local model is generated by the subnode through training, based on a local dataset, a global model received in a (t' + 1)th round of iteration (¶83, Updated local parameters sent by parameter sending module 130, along with any suitable information, such as an iteration number)… fusing, by the computing node according to a model fusion algorithm, m received first local models… to generate a second global model (¶25, In one example, the global server updates parameters of the machine learning model from a weighted average of the parameter updates received from edge servers to form updated global parameters of the machine learning model), and updating the first timestamp t – 1 to a second timestamp t, wherein m is an integer greater than or equal to 1 and less than or equal to K (¶72, parameter receiving module 126 receives global parameters of a machine learning model in a payload portion of a data packet. A header portion of the data packet can indicate any suitable information, such as a timestamp of when the global parameters were generated by global server 108, an iteration number, statistics about the global parameters – after the global model parameters are set, the timestamp is included in outgoing parameter delivery to local devices, i.e., subnodes); and sending, by the computing node, a third parameter to some or all subnodes of the K subnodes in a (t + 1)th round of iteration, wherein the third parameter comprises the second global model and the second timestamp t (¶26, Each iteration of the update process, the global server sends updated global parameters to each of the edge servers, thereby synchronizing the machine learning model on each of the edge servers).
	Garg fails to teach: the first version number is determined by the subnode based on a timestamp received in the (t' + 1)th round of iteration, t' + 1 is greater than or equal to 1 and less than or equal to t, and t' is a natural number.
However, in the same field of endeavor, Chen teaches: the first version number is determined by the subnode based on a timestamp received in the (t' + 1)th round of iteration, t' + 1 is greater than or equal to 1 and less than or equal to t, and t' is a natural number (¶115, The monotonically increasing version numbers may be assigned, for example, using a database server timestamp or a continuously increasing integer value each time an entity is updated).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to determine version numbers based on timestamps as disclosed by Chen in the method disclosed by Garg to reduce conflicts between data versions (¶23, Client devices storing local copies of databases are able to keep the local copies updated and commit changes to the remote versions of the databases in a manner designed to reduce overhead for both client devices and database servers, while also reducing the risk of errors and false conflicts).
Garg in view of Chen fails to teach: when a first threshold is reached.
However, in the same field of endeavor, Choudhary teaches: when a first threshold is reached (¶108, the asynchronous training system 106 generates adjusted global parameters for a global machine learning model after receiving modified parameter indicators from a threshold number of client devices).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to update based on a threshold as disclosed by Choudhary in the method disclosed by Garg in view of Chen to speed up training (¶119, expedites training iterations by preventing the asynchronous training system 106 from waiting for all client devices K to send parameter update differentials before adjusting global parameters).

Regarding claim 2, Garg in view of Chen fails to teach: The method according to claim 1, wherein the first threshold comprises a time threshold L and/or a count threshold N, N is an integer greater than or equal to 1, the time threshold L is a preset quantity of time units configured to upload a local model in each round of iteration, and L is an integer greater than or equal to 1.
However, in the same field of endeavor, Choudhary teaches: wherein the first threshold comprises a time threshold L and/or a count threshold N, N is an integer greater than or equal to 1, the time threshold L is a preset quantity of time units configured to upload a local model in each round of iteration, and L is an integer greater than or equal to 1 (¶108, the asynchronous training system 106 generates adjusted global parameters for a global machine learning model after receiving modified parameter indicators from a threshold number of client devices – and – ¶92, In particular, the asynchronous training system 106 waits for modified parameter indicators from the client devices 406d and 406e subject to a threshold time – ¶93, the threshold time may be any timeframe, including, but not limited to, thirty seconds, five minutes, or one hour).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to update based on a count threshold and/or a time threshold as disclosed by Choudhary in the method disclosed by Garg in view of Chen to speed up training while including slower devices (¶119, expedites training iterations by preventing the asynchronous training system 106 from waiting for all client devices K to send parameter update differentials before adjusting global parameters – and – ¶26, ensure that client devices with slower response times contribute to the resulting model, without unduly slowing the training process waiting for unresponsive client devices).
Regarding claim 3, Garg in view of Chen fails to teach: The method according to claim 2, wherein the fusing, by the computing node according to a model fusion algorithm, m received first local models when a first threshold is reached comprises: when the first threshold is the count threshold N, fusing, by the computing node according to the model fusion algorithm, the m first local models received when the first threshold is reached, wherein m is greater than or equal to the count threshold N (¶60, Based on determining that the subset of client devices 112a and 112b satisfies this threshold number, the asynchronous training system 106 generates the adjusted global parameters. When generating the adjusted global parameters, in some implementations, the asynchronous training system 106 determines an average or weighted average of the sets of modified parameter indicators 204a and 204b and adjusts the global parameters 202 based on the average or weighted average); and when the first threshold is the time threshold L, fusing, by the computing node according to the model fusion algorithm, m first local models received in L time units (¶93, the asynchronous training system 106 waits the threshold time 414… Prior to expiration of the threshold time 414, the asynchronous training system 106 receives modified parameter indicators 408d from the client device 406d. Accordingly, the asynchronous training system 106 adds the modified parameter indicators 408d to the modified parameter indicators 408a-408c and generates adjusted global parameters based on the modified-parameter-indicator sets 408a-408d – also see Fig. 4); or when the first threshold comprises the count threshold N and the time threshold L, and either threshold of the count threshold N and the time threshold L is reached, fusing, by the computing node according to the model fusion algorithm, the m received first local models.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to update by combining local models based on a count threshold and/or a time threshold as disclosed by Choudhary in the method disclosed by Garg in view of Chen to speed up training while including slower devices (¶119, expedites training iterations by preventing the asynchronous training system 106 from waiting for all client devices K to send parameter update differentials before adjusting global parameters – and – ¶26, ensure that client devices with slower response times contribute to the resulting model, without unduly slowing the training process waiting for unresponsive client devices).

Regarding claim 8, it is an apparatus that recites similar limitations to claim 1 and is rejected on the same grounds – see above.

Regarding claim 9, Garg in view of Chen fails to teach: The communication apparatus according to claim 8, wherein the first threshold comprises a time threshold L and/or a count threshold N, N is an integer greater than or equal to 1, the time threshold L is a preset quantity of time units configured to upload a local model in each round of iteration, and L is an integer greater than or equal to 1; or when the first threshold is the count threshold N, the processing unit is specifically configured to, when the first threshold is reached, fuse, according to the model fusion algorithm, the m first local models received when the first threshold is reached, wherein m is greater than or equal to the count threshold N; when the first threshold is the time threshold L, the processing unit is specifically configured to, when the first threshold is reached, fuse, according to the model fusion algorithm, m first local models received in L time units; or when the first threshold comprises the count threshold N and the time threshold L, and either threshold of the count threshold N and the time threshold L is reached, fuse the m received first local models according to model fusion algorithm.
However, in the same field of endeavor, Choudhary teaches: wherein the first threshold comprises a time threshold L and/or a count threshold N, N is an integer greater than or equal to 1 (¶108, a threshold number of client devices – i.e., a number >= 1), the time threshold L is a preset quantity of time units configured to upload a local model in each round of iteration, and L is an integer greater than or equal to 1 (¶92, waits for modified parameter indicators from the client devices 406d and 406e subject to a threshold time – ¶93, the threshold time may be any timeframe, including, but not limited to, thirty seconds, five minutes, or one hour); or (examiner note: this “or” is unclear) when the first threshold is the count threshold N, the processing unit is specifically configured to, when the first threshold is reached, fuse, according to the model fusion algorithm, the m first local models received when the first threshold is reached, wherein m is greater than or equal to the count threshold N (¶60, Based on determining that the subset of client devices 112a and 112b satisfies this threshold number, the asynchronous training system 106 generates the adjusted global parameters. When generating the adjusted global parameters, in some implementations, the asynchronous training system 106 determines an average or weighted average of the sets of modified parameter indicators 204a and 204b and adjusts the global parameters 202 based on the average or weighted average); when the first threshold is the time threshold L, the processing unit is specifically configured to, when the first threshold is reached, fuse, according to the model fusion algorithm, m first local models received in L time units (¶93, the asynchronous training system 106 waits the threshold time 414… Prior to expiration of the threshold time 414, the asynchronous training system 106 receives modified parameter indicators 408d from the client device 406d. Accordingly, the asynchronous training system 106 adds the modified parameter indicators 408d to the modified parameter indicators 408a-408c and generates adjusted global parameters based on the modified-parameter-indicator sets 408a-408d – also see Fig. 4); or when the first threshold comprises the count threshold N and the time threshold L, and either threshold of the count threshold N and the time threshold L is reached, fuse the m received first local models according to model fusion algorithm.

Regarding claim 13, it is a non-transitory computer-readable medium that recites similar limitations to claim 1 and is rejected on the same grounds – see above.
Regarding claims 14 and 15, they recite similar limitations to claims 2 and 3 respectively and are rejected on the same grounds – see above.

Claim(s) 4 and 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Garg in view of Chen and Choudhary as applied to claim(s) 1 above, and further in view of Wu et al. (“FedMed: A Federated Learning Framework for Language Modeling”, 2020), herein Wu.
Regarding claim 4, Garg in view of Chen and Choudhary fails to explicitly teach: The method according to claim 1, wherein the first parameter further comprises a first contribution vector, and the first contribution vector comprises contribution proportions of the K subnodes in the first global model.
However, in the same field of endeavor, Wu teaches: wherein the first parameter further comprises a first contribution vector, and the first contribution vector comprises contribution proportions of the K subnodes in the first global model (pg. 5, Section 3.2, ¶5, According to the Equation (2) and (3), it can be found that the model ratio… has a positive relation with the matrices disparity… which represents the worker model holds a more important weight for global aggregation when it shows higher difference from the server model – also see eq. 5, for the update rule which uses the contribution portions).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use a contribution vector as disclosed by Wu in the method disclosed by Garg in view of Chen and Choudhary to account for dataset variation other than size (pg. 5, ¶2, Each worker model is rendered identical impact on global model by FedAvg despite different worker devices. Theoretically, local training models should be given distinctive ratios due to different data distribution – and – ¶3, the similarity between feature spaces is not always consistent, and thus the global feature space should be constructed in a personalized pattern).

Regarding claim 10, it recites similar limitations to claim 4 and is rejected on the same grounds – see above.

Claim(s) 5 is/are rejected under 35 U.S.C. 103 as being unpatentable over Garg in view of Chen, Choudhary and Wu as applied to claim 4 above, and further in view of Xie et al. (“Asynchronous Federated Optimization”, 2019), herein Xie.
Regarding claim 5, Garg further teaches: The method according to claim 4, wherein the fusing, by the computing node according to a model fusion algorithm, m received first local models, to generate a second global model comprises: determining, by the computing node, a first fusion weight based on… a first sample proportion vector… wherein the first fusion weight comprises a weight of each local model of the m first local models upon model fusion with the first global model, and the first sample proportion vector comprises a proportion of a local dataset of each subnode of the K subnodes in all local datasets of the K subnodes (¶108, parameter update module 128 can generate… non-negative integer ηk that can be used by global update system 112 to give more or less weight to updated local parameters from different edge servers… ηk denotes a number of rows in Ak. Hence, more weight can be given to edge servers with greater numbers of data points – the parameter update is calculated in part by weighting the local updates by ηk/η as seen in the update rules found in ¶91); and determining, by the computing node, the second global model based on the first fusion weight, the m first local models, and the first global model (¶91, a weighted average is computed with weights determined from designators included in parameter updates received from the plurality of edge servers, and more weight is given to edge servers having a greater number of data points – and – ¶92, Global update module 140 pushes the updates of the global parameters to the edge servers).
Garg in view of Chen and Choudhary fails to teach: the first contribution vector… 
However, in the same field of endeavor, Wu teaches: the first contribution vector (pg. 5, Section 3.2, eq. 5 includes a “fusion weight” update rule which uses the contribution portions).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use a contribution vector as disclosed by Wu in the method disclosed by Garg in view of Chen and Choudhary to account for dataset variation other than size (pg. 5, ¶2, Each worker model is rendered identical impact on global model by FedAvg despite different worker devices. Theoretically, local training models should be given distinctive ratios due to different data distribution – and – ¶3, the similarity between feature spaces is not always consistent, and thus the global feature space should be constructed in a personalized pattern).
	Garg in view of Chen, Choudhary and Wu fails to teach: considering the first version number t' corresponding to the m first local models in the fusion weight.
However, in the same field of endeavor, Xie teaches: considering the first version number t' corresponding to the m first local models in the fusion weight (pg. 3, Section 4, ¶1, In the tth epoch, the server receives a locally trained model – “t” is the current epoch, and the difference between “t” and local model epoch “τ” is used to calculate staleness of model updates – pg. 3, col. 1, controls the staleness (t − τ in the updater thread). τ is defined on pg. 2, Table 1: Hτi references τ as the local epoch. Algorithm 1 on pg. 3 references the staleness being used to weigh worker updates).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use the model version number to determine a fusion weight as disclosed by Xie in the method disclosed by Garg in view of Chen, Choudhary and Wu to improve model performance when clients have varied availability (pg. 9, col. 2, ¶5-7, Unlike FedAvg, stragglers’ updates will not be dropped. When the staleness is small, FedAsync converges much faster… FedAsync can schedule training tasks even if the workers are currently ineligible, since the server does not wait until the workers respond… Compared to FedAvg, FedAsync can handle more workers running in parallel since all the updates on the server and the workers are non-blocking).

Allowable Subject Matter
Claims 6, 7, 11 and 12 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Jiang et al. (US 20190171952 A1), discusses tracking timestamps of when global models send updates in a distributed machine learning system, and Nabati et al. (US 20190303828 A1), which discusses managing resource allocation requests in relation to a total capacity and various probabilities.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HARRISON CHAN YOUNG KIM whose telephone number is (571)272-0713. The examiner can normally be reached Monday - Thursday 10:00 am - 7:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, CESAR PAULA can be reached at (571) 272-4128. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/HARRISON C KIM/               Examiner, Art Unit 2145         



/CESAR B PAULA/               Supervisory Patent Examiner, Art Unit 2145
Read full office action
METHOD FOR SEMI-ASYNCHRONOUS FEDERATED LEARNING AND COMMUNICATION APPARATUS

This examiner grants 50% of cases after interview

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

METHOD FOR SEMI-ASYNCHRONOUS FEDERATED LEARNING AND COMMUNICATION APPARATUS

This examiner grants 50% of cases after interview

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email