Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This office action is in response to submission of application on 5/4/2023.
Claims 1-20 are presented for examination.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-20 are rejected under 35 U.S.C. §103 as being unpatentable over Marfoq, et al (Federated Multi-Task Learning under a Mixture of Distributions, herein Marfoq), Shamsian, et al (Personalized Federated Learning using Hypernetworks, herein Shamsian), and Wu, et al (BlockDrop: Dynamic Inference Paths in Residual Networks, herein Wu).
Regarding claim 1,
Marfoq teaches a computer-implemented method for personalizing heterogeneous clients (Marfoq, page 2, paragraph 7, line 1 “In our framework, a personalized model is a linear combination of M shared component models. All clients jointly learn the M components, while each client learns its personalized mixture weights.” and, page 70, paragraph 1, line 1 “We ran the experiments on a CPU/GPU cluster, with different GPUs available (e.g., Nvidia Tesla V100, GeForce GTX 1080 Ti, Titan X, Quadro RTX 6000, and Quadro RTX 8000).” In other words, CPU/GPU cluster is computer, framework is computer-implemented method, personalized is personalized, and clients is clients.), the method comprising:
initializing a federated modular network including a plurality of clients communicating with a server (Marfoq, Algorithm 2,
PNG
media_image1.png
695
785
media_image1.png
Greyscale
In other words, initialization is initializing, clients is plurality of clients, and server is server.) ;
maintaining, within the server, a heterogenous module pool having sub-blocks and a [routing hypernetwork] (Marfoq, page 2, paragraph 4, line 1 “More recently, [57] proposed pFedHN. pFedHN feeds local clients’ representations to a global (across clients) hypernetwork, which can output personalized heterogeneous models. Unfortunately, the hypernetwork has a large memory footprint already for small clients’ models (e.g., the hypernetwork in the experiments in [57] has 100 more parameters than the output model).” And, page 2, paragraph 7, line 1 “ In our framework, a personalized model is a linear combination of M shared component models. All clients jointly learn the M components, while each client learns its personalized mixture weights. We show that federated EM-like algorithms can be used for training.” In other words, M shared component models…each client learns its personalized mixture of weights is heterogenous module pool.) ;
[partitioning the plurality of clients by modeling a joint distribution of each client into clusters];
enabling each client to make a decision in each update to assemble a personalized model by [selecting a combination of sub-blocks] from the heterogenous module pool (Marfoq, page 2, paragraph 7, line 1 “In our framework, a personalized model is a linear combination of M shared component models. All clients jointly learn the M components, while each client learns its personalized mixture weights. We show that federated EM-like algorithms can be used for training. In particular, we propose FedEM and D-FedEM for the client-server and the fully decentralized settings, respectively, and we prove convergence guarantees. Our approach also provides a principled and efficient way to infer personalized models for clients unseen at training time.” In other words, component is module, and each client learns its personalized mixture of weights is assemble a personalized model.); and
generating, by the [routing hypernetwork], the decision for each client (Marfoq, see above mapping. In other words, infer personalized models is generating, the decision for each client.).
Marfoq teaches a hypernetwork by referencing the work of Shamsian. However, for the purpose of clarity, Shamsian is combined into Marfoq.
Thus far, Marfoq does not explicitly teach using routing hypernetworks. Shamsian teaches using routing hypernetworks (Shamsian, page 1, column 2, paragraph 2, line 1 In this work, we describe a new approach named pFedHN for personalized Federated HyperNetwork, which aims to resolve these challenges using hypernetworks (Ha et al., 2017). A hypernetwork is a model that produces parameters for another neural networks. Our approach learns to smartly share parameters across clients by using a single joint hypernetwork to generate separate network parameters for every client.” In other words, hypernetwork that smartly shares parameters across clients is routing hypernetwork.)
Both Marfoq and Shamsian are directed to federated learning, among other things. Marfoq teaches a computer-implemented method for personalizing heterogeneous clients, the method comprising initializing a federated modular network including a plurality of clients communicating with a server, maintaining, within the server, a heterogenous module pool having sub-blocks, partitioning the plurality of clients by modeling a joint distribution of each client into clusters, and, enabling each client to make a decision in each update to assemble a personalized model by selecting a combination of sub-blocks from the heterogenous module pool; but does not explicitly teach a routing hypernetwork. Shamsian teaches a routing hypernetwork.
In view of the teaching of Marfoq it would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Shamsian into Marfoq. This would result in a computer-implemented method for personalizing heterogeneous clients, the method comprising initializing a federated modular network including a plurality of clients communicating with a server, maintaining, within the server, a heterogenous module pool having sub-blocks and a routing hypernetwork, partitioning the plurality of clients by modeling a joint distribution of each client into clusters, enabling each client to make a decision in each update to assemble a personalized model by selecting a combination of sub-blocks from the heterogenous module pool, and generating, by the routing hypernetwork, the decision for each client.
One of ordinary skill in the art would be motivated to do this to account for data disparities across clients and reduce communication costs. (Shamsian, Abstract, line 3 “The
goal is to train personalized models collaboratively while accounting for data disparities across
clients and reducing communication costs. We propose a novel approach to this problem using hypernetworks, termed pFedHN for personalized Federated HyperNetworks. In this approach,
a central hypernetwork model is trained to generate a set of models, one model for each client. This architecture provides effective parameter sharing across clients while maintaining the
capacity to generate unique and diverse personal models.”)
Thus far, the combination of Marfoq and Shamsian does not explicitly teach partitioning the plurality of clients by modeling a joint distribution of each client into clusters.
The combination of Marfoq and Shamsian teaches partitioning the plurality of clients by modeling a joint distribution of each client into clusters because Marfoq describes it as part of another model. However, for clarity, the other model described in Marfoq is combined into the combination of Marfoq and Shamsian.
Marfoq teaches partitioning the plurality of clients by modeling a joint distribution of each client into clusters (Marfoq, page 5, paragraph 4, line 1 “Clustered Federated Learning [56, 20] assumes that each client belongs to one among C clusters and proposes that all clients in the same cluster learn the same model. Our framework recovers this scenario considering M = C and π*tc = 1 if task (client) t is in cluster c and π*tc = 0 otherwise.” And, page 3, paragraph 4, line 5 “The goal is then to solve (in parallel) the following optimization problems
PNG
media_image2.png
46
700
media_image2.png
Greyscale
where ht : X-> Δ|y| (ΔD denoting the unitary simplex of dimension D), l: Δ|y| X y -> R+ is a loss function,1 and LDt (ht) = E(x;y)~Dt [l(ht(x); y)] is the true risk of a model ht under data distribution Dt. For (x; y) ϵ X x Y, we will denote the joint distribution density associated to Dt by pt(x; y), and the marginal densities by pt(x) and pt(y).” In other words, C clusters is partitioning the plurality of clients into clusters, and joint distribution density is by modeling a joint distribution of each client.)
It would be obvious to combine the additional model of Marfoq into the combination of Marfoq and Shamsian because both models are described in Marfoq and both models are directed to federated learning, among other things.
Thus far, the combination of Marfoq, Shamsian, and Marfoq does not explicitly teach selecting a combination of sub-blocks.
Wu teaches selecting a combination of sub-blocks (Wu, Equations 1 and 2, and page 4, column 1, paragraph 3, line 1 “The configurations in the context of ResNets represent decisions to keep/drop each block, where each decision to drop a block corresponds to removing a subset of paths from the network.” And, page 4, column 1, paragraph 5 “Formally, given an image x and a pretrained ResNet with K residual blocks, we define a policy of block-dropping
behavior as a K-dimensional Bernoulli distribution:
PNG
media_image3.png
92
359
media_image3.png
Greyscale
In other words, keeping or dropping each block is selecting a combination of sub-blocks.)
Both Wu, and the combination of Marfoq, Shamsian, and Marfoq are directed to neural network models, among other things. The combination of Marfoq, Shamsian, and Marfoq teach a computer-implemented method for personalizing heterogeneous clients, the method comprising initializing a federated modular network including a plurality of clients communicating with a server, maintaining, within the server, a heterogenous module pool having sub-blocks and a routing hypernetwork, partitioning the plurality of clients by modeling a joint distribution of each client into clusters and enabling each client to make a decision in each update to assemble a personalized model; but does not explicitly teach selecting a combination of sub-blocks. Wu teaches selecting a combination of sub-blocks.
In view of the teaching of the combination of Marfoq, Shamsian, and Marfoq, it would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Wu into the combination of Marfoq, Shamsian, and Marfoq. This would result in a computer-implemented method for personalizing heterogeneous clients, the method comprising initializing a federated modular network including a plurality of clients communicating with a server, maintaining, within the server, a heterogenous module pool having sub-blocks and a routing hypernetwork, partitioning the plurality of clients by modeling a joint distribution of each client into clusters and enabling each client to make a decision in each update to assemble a personalized model by selecting a combination of sub-blocks.
One of ordinary skill in the art would be motivated to do this in order to reduce the computational expense of deep neural networks. (Wu, abstract, line 1 “Very deep convolutional neural networks offer excellent recognition results, yet their computational expense limits
their impact for many real-world applications. We introduce BlockDrop, an approach that learns to dynamically choose which layers of a deep network to execute during inference so as to best reduce total computation without degrading prediction accuracy.”)
Regarding claim 2,
The combination of Marfoq, Shamsian, Marfoq, and Wu teaches the computer-implemented method of claim 1, wherein
the federated modular network adopts modular networks including a group of encoders in a first layer and multiple modular blocks in subsequent layers (Marfoq, page 69, paragraph 2, line 3 “ We used Pachinko allocation [54] with parameters α = 0:4 and β = 10 to partition CIFAR-100 on 100 clients. For both of them we train MobileNet-v2 [55] architecture with an additional linear layer. We used TorchVision [45] implementation of MobileNet-v2.” In other words, TorchVision MobileNetv2 is a modular network with a first layer that is an encoder layer, with multiple subsequent layers.)
Regarding claim 3,
The combination of Marfoq, Shamsian, Marfoq, and Wu teaches the computer-implemented method of claim 2, wherein
connection decisions between the multiple modular blocks in the modular networks are made by the routing hypernetwork (Shamsian, page 1, column 2, paragraph 2, line 5 “Our approach learns to smartly share parameters across clients by using a single joint hypernetwork
to generate separate network parameters for every client.” In other words, share parameters across clients by using a hypernetwork is connection decisions between the multiple blocks are made by the routing hypernetwork. Examiner notes that modular blocks is previously mapped. See mapping of claim 1.).
Regarding claim 4,
The combination of Marfoq, Shamsian, Marfoq, and Wu teaches the computer-implemented method of claim 1, wherein
the decision that is parameterized by the routing hypernetwork is a vector of discrete variables following a Bernoulli distribution (Wu, page 4, column 1, paragraph 5 “Formally, given an image x and a pretrained ResNet with K residual blocks, we define a policy of block-dropping behavior as a K-dimensional Bernoulli distribution…” and, page 4, column 1, paragraph 3, line 5 “To derive the optimal dropping strategy given an input instance, we develop a policy network to output a binary policy vector, representing the actions to keep or drop a block in a pretrained ResNet.” In other words, binary policy vector is vector of discrete variables, Bernoulli distribution is Bernoulli distribution, and dropping strategy… based on a policy vector is the decision that is parameterized by a vector of variables following a Bernoulli distribution.).
Regarding claim 5,
The combination of Marfoq, Shamsian, Marfoq, and Wu teaches the computer-implemented method of claim 1, wherein
each client with similar decisions is assigned into a same cluster in each communication round (Shamsian, Figure 4, and, page 8, column 2, paragraph 2, line 1 “In our experiments, we learn to represent each client using a trainable embedding vector vi. These embedding vectors, therefore, learn a continuous semantic representation over the set of clients. The smooth nature of this representation gives the HN the power to share information across clients.” And, page 8, column 2, paragraph 4 “In Figure 4 we project the learned embedding vectors into R2 using the t-SNE algorithm (Maaten & Hinton, 2008). A clear structure is presented, in which clients from the same group (in terms of coarse labels) are clustered together.”
PNG
media_image4.png
342
454
media_image4.png
Greyscale
Examiner notes, the specification of the instant application recites “A decision parameterized by the routing hypernetwork 110 is a vector of discrete variables following the Bernoulli distribution.” (Specification, paragraph [0023], line 10.) Therefore, examiner is interpreting a decision is a learned vector of discrete variables that represents a client. In other words, vector is used to create clusters is each client with similar decisions is assigned to the same cluster.)
Regarding claim 6,
The combination of Marfoq, Shamsian, Marfoq, and Wu teaches the computer-implemented method of claim 1, wherein
each client uploads only a subset of model parameters to the server to decrease a communication cost between the plurality of clients and the server (Marfoq, Algorithm 1,
PNG
media_image5.png
384
941
media_image5.png
Greyscale
In other words line 8 shows client uploads a subset of parameters to the server. )
Regarding claim 7,
The combination of Marfoq, Shamsian, Marfoq, and Wu teaches the computer-implemented method of claim 1, wherein when copying the personalized model from the server,
a client of the plurality of clients only copies parameters of active blocks from a global model to reduce unnecessary communication costs between the plurality of clients and the server (Shamsian, Figure 1,
PNG
media_image6.png
411
939
media_image6.png
Greyscale
In other words, HN, located on the server, communicates a personal model Ɵi for each client is a client only copies active blocks from a global model.)
Claims 8-14 are computer program product claims, where the computer program product comprises a non-transitory computer readable storage medium for storing program instructions, that corresponds to computer implemented method claims 1-7, respectively. Otherwise, they are not patentably distinct. The combination of Marfoq, Shamsian, Marfoq, and Wu teaches a computer program produce that comprises a non-transitory computer readable storage medium for storing program instructions (Marfoq, page 70, paragraph 1, line 1 “We ran the experiments on a CPU/GPU cluster, with different GPUs available (e.g., Nvidia Tesla V100, GeForce GTX 1080 Ti, Titan X, Quadro RTX 6000, and Quadro RTX 8000). Most experiments with CIFAR10/CIFAR-100 and EMNIST were run on GeForce GTX 1080 Ti cards, while most experiments with Shakespeare and FEMNIST were run on the Quadro RTX 8000 cards.” Examiner notes that, GeForce GTX 1080 Ti cards, etc., contain memory and storage for storing computer instructions.) Therefore, claims 8-14 are rejected for the same reasons as claims 1-7, respectively.
Claims 15-20 are computer processing system claims corresponding to computer-implemented method claims 1-6, respectively. Otherwise, they are not patentably distinct. The combination of Marfoq, Shamsian, Marfoq, and Wu teaches a computer processing system. (Marfoq, See above mapping. In other words, CPU/GPU cluster is a computer processing system.) Therefore, claims 15-20 are rejected for the same reasons as claims 1-6, respectively.
The prior art made of record and not used is considered pertinent to applicant’s disclosure.
Fontoura, et al (US 8,103,748 B2), “Rule-Based Method and System for Managing Heterogenous Computer Clusters” discloses a system and method for managing heterogenous clusters asynchronously accesses operating information received from clusters to send the necessary information which is then evalut3ed against rules supplied by the clusters.
Goodsitt, et al, (US 12,361,332 B2), “Systems and Methods for Federated Learning Optimization Via Cluster Feedback” discloses a method for generating a cluster-based machine leaning model based on federated learning with cluster feedback includes providing a current machine learning model to a plurality of user devices that train the current machine learning model.
Litany, et al, (US 2022/0391781 A1), “Architecture-Agnostic Federated Learning System” disclo9ses a method performed by a server that sends copies of a set of parameters of a hyper network (HN) to at least once client device, receiving from each client device in the at least one client device, a corresponding set of updated parameters of the HN.
Sattler, et al “Clustered Federated Learning: Model-Agnostic Distributed Multitask Optimization Under Privacy Constraints” discloses clustered FL (CFL) a novel federated multitask learning (FMTL) framework, which exploits geometric properties of the FL loss surface to group the client population into clusters with jointly trainable data distributions.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BART RYLANDER whose telephone number is (571)272-8359. The examiner can normally be reached Monday - Thursday 8:00 to 5:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached at 571-270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Bart I Rylander/Examiner, Art Unit 2124