Prosecution Insights
Last updated: April 19, 2026
Application No. 17/770,919

Online Federated Learning of Embeddings

Non-Final OA §103
Filed
Apr 21, 2022
Examiner
CAMPOS, ALFREDO
Art Unit
2129
Tech Center
2100 — Computer Architecture & Software
Assignee
Google LLC
OA Round
3 (Non-Final)
83%
Grant Probability
Favorable
3-4
OA Rounds
3y 9m
To Grant
99%
With Interview

Examiner Intelligence

Grants 83% — above average
83%
Career Allow Rate
5 granted / 6 resolved
+28.3% vs TC avg
Strong +33% interview lift
Without
With
+33.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 9m
Avg Prosecution
26 currently pending
Career history
32
Total Applications
across all art units

Statute-Specific Performance

§101
33.3%
-6.7% vs TC avg
§103
42.8%
+2.8% vs TC avg
§102
3.9%
-36.1% vs TC avg
§112
20.0%
-20.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 6 resolved cases

Office Action

§103
Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Claim Objections Claim 2 objected to because of the following informalities: Claim 2 in line 2 to 3 recites “the ”. The limitation has an repeated “the” and it should also be removed to “the slice of embeddings comprises”. Appropriate correction is required. Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claim(s) 1, 2, 3, 7, 8, 9, 11-18, 21, and 22 are rejected under 35 U.S.C. 103 as being unpatentable over Ji et al., Learning Private Neural Language Modeling with Attentive Aggregation, (2019) (“Ji”) in view of Green (US20180157989A1) (“Green”) and Tang et al. Sentiment Embeddings with Applications to Sentiment Analysis in IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 2, pp. 496-509, 1 Feb. 2016 (“Tang”). Regarding claim 1 and analogous claims 17 and 18, Ji teaches a computer-implemented method for federated learning of embeddings with reduced communication cost and increased privacy, the method comprising: (Ji Page 1 and 2 I. Introduction Para. 5-6, In the real-world scenario, users’ language input and preferences are sensitive and may contain some private content including private personal profiles, financial records, passwords, and social relations. Thus, to protect the user’s privacy, a federated learning technique with data protection is a promising solution. In this paper, we take this application as learning word-level private neural language modeling for each user [federated learning of embeddings]. Federated learning learns a shared global model by the aggregation of local models on client devices [A computer-implemented method]. But the original paper on federated learning [1] only uses a simple average on client models, taking the number of samples in each client device as the weight of the average. In the mobile keyboard applications, language preferences may vary from individual to individual. The contributions of client language models to the central server are quite different. To learn a generalized private language model that can be quickly adapted to different people’s language preferences, knowledge transferring between server and client, especially the well trained clients models, should be considered. Page 5, Communication cost for parameter uploading and downloading between the clients and server is another important issue for decentralized learning. Communication, both wired and wireless, depends on Internet bandwidth highly and has an impact on the performance of federated optimization. To save the capacity of network communication, decentralized training should be more communication-efficient. Several approaches apply compression methods to achieve efficient communication. Our method accelerates the training through the optimization of the global server as it can converge more quickly than its counterparts [with reduced communication cost] Page 5 D. Differential Privacy 1 para, To protect the client’s data from an inverse engineering attack, we apply the randomized mechanism into federated learning [2]. This ensures differential privacy on the client side without revealing the client’s data [2]. This differentially private randomization was firstly proposed to apply on the federated averaging, where a white noise with the mean of 0 and the standard deviation of _ is added to the client parameters in Equation 5 [increased privacy]): for each of one or more federated learning iterations: (Ji et al. Page 3 A. Preliminaries of Federated Learning Para 1., Federated learning decouples the model training and data collection [1]. To learn a well generalized model, it uses model aggregation on the server side, which is similar to the works on meta-learning by learning a good initialization for quick adaptation [25], [26] and transfer learning by transferring knowledge between domains [27]. The basic federated learning framework comprises two main parts, i.e., server optimization in Algorithm 1 and local training in Algorithm 2 [1] [for each of one or more federated learning iterations]). collecting, by a client computing device, a local dataset [that identifies one or more positive entities, wherein a positive entity is an entity identified in the local dataset], wherein the local dataset is stored locally by the client computing device, and wherein the local dataset is not directly accessible to a server computing device (Ji Page 3, Private Model Update. Each online selected client receives the server model and performs secure local training on their own devices. For the neural language modeling, stochastic gradient descent is performed to update their GRU-based client models which is introduced in Section III-C. After several epochs of training, the clients send the parameters of their models to the central server over a secure connection. During this local training, user data can be stored on their own devices [wherein the local dataset is stored locally by the client computing device, and wherein the local dataset is not directly accessible to a server computing device]. Ji Algorithm 2 Secure local Training on Client, PNG media_image1.png 234 358 media_image1.png Greyscale (i.e. at lines 5-9 collecting, by a client computing device, a local dataset)); learning, by the client computing device [and based at least in part on the local dataset and the slice of embeddings, updated values for one or more of the embeddings contained in the slice of embeddings associated with the machine-learned model] (Ji Page 3 Algorithm 2 Secure local Training on Clients PNG media_image2.png 236 354 media_image2.png Greyscale ); learning, by the client computing device [and based at least in part on the local dataset and the slice of embeddings, updated values for one or more of the embeddings contained in the slice of embeddings PNG media_image2.png 236 354 media_image2.png Greyscale ); and communicating, by the client computing device, information descriptive of the updated values for the one or more of the embeddings contained in the slice of embeddings to the server computing device (Ji Page 3, III. PROPOSED METHOD A. Preliminaries of Federated Learning Para 3. Private Model Update. Each online selected client receives the server model and performs secure local training on their own devices. For the neural language modeling, stochastic gradient descent is performed to update their GRU-based client models which is introduced in Section III-C. After several epochs of training, the clients send [communicating] the parameters [information descriptive] of their models to the central server over a secure connection. During this local training, user data can be stored on their own devices. (See Ji Page 3 Algorithm 2 Secure local Training on Clients line 10))). Ji does not explicitly teach [collecting, by a client computing device, a local dataset] that identifies one or more positive entities, wherein a positive entity is an entity identified in the local dataset, [wherein the local dataset is stored locally by the client computing device, and wherein the local dataset is not directly accessible to a server computing device] obtaining, by the client computing device and from the server computing device, wherein the slice of embeddings comprises a subset of a larger vocabulary of embeddings wherein the slice of embeddings comprises embeddings that have been previously grouped together based on at least one of a semantic similarity and a semantic association; wherein the slice of embeddings include one or more positive entity embeddings respectively associated with the one or more positive entities identified in the local dataset and one or more negative entity embeddings respectively associated with one or more negative entities, wherein a negative entity is an entity not identified in the local dataset, thereby forming a superset of embeddings that obfuscates which of the embeddings contained in the slice of embeddings correspond to positive entities identified within the local dataset; [learning, by the client computing device] and based at least in part on the local dataset and the slice of embeddings, updated values for one or more of the embeddings contained in the slice of embeddings associated with the machine-learned model [and communicating, by the client computing device,] information descriptive of the updated values for the one or more of the embeddings contained in the slice of embeddings associated with the machine-learned model [to the server computing device]. However Green teaches [collecting, by a client computing device, a local dataset] that identifies one or more positive entities, wherein a positive entity is an entity identified in the local dataset, [wherein the local dataset is stored locally by the client computing device, and wherein the local dataset is not directly accessible to a server computing device] (Green Para. 0040, The item-context information retrieval module 204 can be configured to retrieve information associated with a training instance. In certain embodiments, a particular computing machine in the embedding system can receive a training call, i.e., an API call including a training instance. However, information relevant to the training call may be distributed across multiple computing machines in the embedding system. The item-context information retrieval module 204 can be configured to retrieve information for the training instance from multiple computing machines in the embedding system. As discussed above, a training instance can include an item-context pair comprising an item element and one or more context elements . Para 0043, The positive downsampling module 206 can be configured to determine whether or not a training instance will be useful for updating embeddings, i.e., to make a usefulness determination for each training instance. Unhelpful training instances can be skipped, i.e., if a training instance is determined to be unhelpful, that training instance will not be used for updating embeddings [that identifies one or more positive entities, wherein a positive entity is an entity identified in the local dataset]); obtaining, by the client computing device and from the server computing device, (Green Para 0041, For the item element, the item-context information retrieval module 204 can be configured to retrieve the item element's embedding, training data associated with the item element, and online data associated with the item element [obtaining, by the client computing device and from the server computing device]. Training data associated with the item element can include any information required by a particular training algorithm or process for updating the item element's embedding. As such, the particular training data required can vary based on what training algorithm or process is used by the embedding system (e.g., skip-gram negative sampling, *-space, etc.). Online data associated with the item element can include the number of training instances that have occurred for the item element, nearest neighbor information for the item element (e.g., item IDs and distances from the item element embedding for each nearest neighbor of the item element), and other information that may be of interest ( e.g., the most recent previous embedding for the item element or multiple previous embeddings for the item element [for a slice of embeddings]), wherein the slice of embeddings comprises a subset of a larger vocabulary of embeddings (Green Para 0041, For the item element, the item-context information retrieval module 204 can be configured to retrieve the item element's embedding, training data associated with the item element, and online data associated with the item element. Training data associated with the item element can include any information required by a particular training algorithm or process for updating the item element's embedding. As such, the particular training data required can vary based on what training algorithm or process is used by the embedding system (e.g., skip-gram negative sampling, *-space, etc.) [wherein the slice of embeddings comprises a subset of a larger vocabulary of embeddings associated with the machine-learned model]. Online data associated with the item element can include the number of training instances that have occurred for the item element, nearest neighbor information for the item element (e.g., item IDs and distances from the item element embedding for each nearest neighbor of the item element), and other information that may be of interest ( e.g., the most recent previous embedding for the item element or multiple previous embeddings for the item element). (Examiner Note: The system only retrieves relevant embeddings thus having reduce communication cost)); wherein the slice of embeddings comprises embeddings that have been previously grouped together [based on at least one of a semantic similarity and a semantic association] (Para 0036, In certain embodiments, a training instance can be made up of an item-context pair. Training instances can be defined using the JOINKEY information and the embedding element information. For each JOINKEY, the training instance definition module 106 can group together predefined numbers of embedding elements that are identified by the embedding element information. For example, if the JOINKEY is a post ID, the embedding elements are user IDs (e.g., user IDs that have engaged with the post identified by the post ID), and the pre-determined number of embedding elements is set to five, groups of five user IDs associated with the JOINKEY can be grouped into a training instance. In each training instance, one embedding element can be labeled as the item element and the remaining embedding elements can be labeled as context elements. For example, if the training instance definition module 106 receives user IDs in the order of engagement with the post ID, for each grouping of five user IDs, the middle user ID can be labeled as the item element, and the other four user IDs can be labeled as context elements. Para 0039, FIG. 2 illustrates an example embedding update module 202 configured to update embeddings based on training instances, according to an embodiment of the present disclosure. In some embodiments, the embedding update module 108 of FIG. 1 can be implemented as the embedding update module 202. As shown in the example of FIG. 2, the embedding update module 202 can include an item-context information retrieval module 204 [wherein the slice of embeddings comprises embeddings], a positive downsampling module 206, a negative sampling module 208, and a training module 210 (i.e. previously grouped together)), and wherein the slice of embeddings include one or more positive entity embeddings respectively associated with the one or more positive entities identified in the local dataset and one or more negative entity embeddings respectively associated with one or more negative entities, wherein a negative entity is an entity not identified in the local dataset, thereby forming a superset of embeddings that obfuscates which of the embeddings contained in the slice of embeddings correspond to positive entities identified within the local dataset (Green Para. 0040, The item-context information retrieval module 204 can be configured to retrieve information associated with a training instance. In certain embodiments, a particular computing machine in the embedding system can receive a training call, i.e., an API call including a training instance. However, information relevant to the training call may be distributed across multiple computing machines in the embedding system. The item-context information retrieval module 204 can be configured to retrieve information for the training instance from multiple computing machines in the embedding system. As discussed above, a training instance can include an item-context pair comprising an item element and one or more context elements. Para 0043 line 1-5, The positive downsampling module 206 can be configured to determine whether or not a training instance will be useful for updating embeddings, i.e., to make a usefulness determination for each training instance. Unhelpful training instances can be skipped, i.e., if a training instance is determined to be unhelpful, that training instance will not be used for updating embeddings [wherein the slice of embeddings include one or more positive entity embeddings respectively associated with the one or more positive entities identified in the local dataset]. (Examiner Note: The system will identify relevant positive embeddings that are locally stored) Para 0044 line 3-9, The negative sampling module 208 can be configured to select one or more negative samples for addition to the training instance. Negative sampling forces dissimilar entities from having dissimilar embeddings. Negative sampling prevents the tendency for all entities to converge to the same embedding. In certain embodiments, a local sample cache can be used for selecting negative samples to include in the training instance. The local sample cache can be a memory bounded rotating array. The local sample cache can include a read head and a write head that operate independently. The local sample cache can be read from nonlinearly to simulate sample distributions different from the training distribution. Para 0048, Once a set of negative samples is determined for the training instance, the negative sample module 208 can be configured to retrieve data associated with each negative sample. In certain embodiments, the negative sampling module 208 can retrieve the same information for each negative sample as was retrieved for each context element, i.e., the negative sample's embedding, and training data for each negative sample ( determined by the training algorithm or process utilized by the embedding system) [and one or more negative entity embeddings respectively associated with one or more negative entities, wherein a negative entity is an entity not identified in the local dataset,]. Para 0049 line 1-20, The training module 210 can be configured to update embeddings based on a training instance. At this point, the training instance includes an item element, one or more context elements, and one or more negative samples. Information has also been gathered for each element in the training instance, e.g., item element embedding, item training data, item online data, embeddings and training data for each context element, and embeddings and training data for each negative sample. Embeddings are updated based on the training instance and the information collected relating to the training instance and based on the training process or algorithm utilized by the embedding system. The end result is that for every element in the training instance, i.e., the item element, the context elements, and the negative samples, the embeddings have changed, and the training data associated with each element has changed. A training instance counter associated with the item element (and/or each context element) can be incremented to keep an up-to-date count of the number of times in which the item element ( or context element) has been included in a training instance [thereby forming a superset of embeddings that obfuscates which of the embeddings contained in the slice of embeddings correspond to positive entities identified within the local dataset]); [learning, by the client computing device] and based at least in part on the local dataset and the slice of embeddings, updated values for one or more of the embeddings contained in the slice of embeddings associated with the machine-learned model (Green (para 0049 line 1-16, The training module 210 can be configured to update embeddings based on a training instance. At this point, the training instance includes an item element, one or more context elements, and one or more negative samples. Information has also been gathered for each element in the training instance, e.g., item element embedding, item training data, item online data, embeddings and training data for each context element, and embeddings and training data for each negative sample. Embeddings are updated based on the training instance and the information collected relating to the training instance and based on the training process or algorithm utilized by the embedding system. The end result is that for every element in the training instance, i.e., the item element, the context elements, and the negative samples, the embeddings have changed, and the training data associated with each element has changed [and based at least in part on the local dataset and the slice of embeddings, updated values for one or more of the embeddings contained in the slice of embeddings associated with the machine-learned model]); [and communicating, by the client computing device,] information descriptive of the updated values for the one or more of the embeddings contained in the slice of embeddings [to the server computing device] (Green para 0049 line 9-27, Embeddings are updated based on the training instance and the information collected relating to the training instance and based on the training process or algorithm utilized by the embedding system. The end result is that for every element in the training instance, i.e., the item element, the context elements, and the negative samples, the embeddings have changed, and the training data associated with each element has changed [ information descriptive of the updated values for the one or more of the embeddings contained in the slice of embeddings]. A training instance counter associated with the item element (and/or each context element) can be incremented to keep an up-to-date count of the number of times in which the item element ( or context element) has been included in a training instance. Additionally, any updated data can be written back to the various computing machines on the embedding system). Ji and Green are considered to be analogous to the claimed invention because they are in the same field of invention of using machine learning to learn from embeddings in multiple systems. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filling date of the claimed invention to have modified Ji and incorporate the teaches of Green and disclose obtaining positive and negative embeddings associated with entities. Doing so to train by identifying helpful positive entities and using negative entities to prevent entities from converging to the same embeddings (Green Para 0043 line 1-5, The positive downsampling module 206 can be configured to determine whether or not a training instance will be useful for updating embeddings, i.e., to make a usefulness determination for each training instance. Unhelpful training instances can be skipped, i.e., if a training instance is determined to be unhelpful, that training instance will not be used for updating embeddings, Para 0044 3-6, Negative sampling forces dissimilar entities from having dissimilar embeddings. Negative sampling prevents the tendency for all entities to converge to the same embedding). Tang teaches [wherein the slice of embeddings comprises embeddings that have been previously grouped together] based on at least one of a semantic similarity and a semantic association (Tang Page 500, 3.5 Modeling Lexical Level Information, We investigate lexical level information for enhancing sentiment embeddings in this part. We use two kinds of lexical level information, namely word-word associations [a semantic similarity] and word-sentiment associations. We develop two regularizers to naturally incorporate them into aforementioned sentiment, context and hybrid neural models. Page 500, 3.5.1 Integrating Word-Word Association Para. 1, We model word-word association in this part, holding the consideration that the words from same cluster should be as close with each other in the embedding space. Page 501 3.5.2 Integrating Word-Sentiment Association Para 1 line 1-3, We integrate word- sentiment association by directly predicting the sentiment polarity of each word regarding the embedding values of each word as features. Page 501, 3.6.1 Datasets Para. 6, In order to collect sentiment information of words, we use the aforementioned word clusters from Urban Dictionary to expand a small size of manually labeled sentiment seeds. Specifically, we manually label the top frequent 500 words from the vocabulary of sentiment embedding as positive, negative or neutral. After removing the ambiguous ones, we obtain 125 positive, 109 negative and 140 neutral words, which are regarded as the sentiment seeds. Afterwards, we use the similar words from Urban Dictionary to expand the sentiment seeds. We formulate this procedure as a k-nearest neighbors (KNN) classifier by regarding sentiment seeds as gold standard. We apply the KNN classifier to the items of Urban Dictionary word cluster, and predict a three-dimensional discrete vector “%knnpos; knnneg; knnneu_ for each item. Each value reflects the hits numbers of sentiment seeds with different sentiment polarity in its similar words. For example, the vector value of “coooolll” is [10; 0; 0], which means that there are 10 positive seeds, 0 negative seeds and)), Ji, Green, and Tang are considered to be analogous to the claimed invention because they are in the same field of invention of using machine learning to learn embeddings. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filling date of the claimed invention to have modified Ji and Green to incorporate the teaches of Tang and use embeddings that are grouped by a semantic similarity and a semantic association. Doing so to capture rich relational structure of the lexicon (Tang Page 1 1 Introduction Para 2 Line 1-8 A straight forward way is to represent each word as a one-hot vector, whose length is vocabulary size and only one dimension is 1, with all others being 0. However, onehot word representation only encodes the indices of words in a vocabulary, but fails to capture rich relational structure of the lexicon. To solve this problem, many studies represent each word as a continuous, low-dimensional and real-valued vector, also known as word embeddings). Regarding claim 2, Ji in view of Green and Tang teach the computer-implemented method according to claim 1. Ji, Green, and Tang are combined in the same rationale as in claim 1 analogous claims 17 and 18. Green teaches wherein obtaining, by the client computing device from the server computing device, the (Green Para 0040, The item-context information retrieval module 204 can be configured to retrieve information associated with a training instance. In certain embodiments, a particular computing machine in the embedding system can receive a training call, i.e., an API call including a training instance. However, information relevant to the training call may be distributed across multiple computing machines in the embedding system. The item-context information retrieval module 204 can be configured to retrieve information for the training instance from multiple computing machines in the embedding system. As discussed above, a training instance can include an item-context pair comprising an item element and one or more context elements. Para 0041, For the item element, the item-context information retrieval module 204 can be configured to retrieve the item element's embedding, training data associated with the item element, and online data associated with the item element [requesting and receiving, by the client computing device and from the server computing device, ]. Training data associated with the item element can include any information required by a particular training algorithm or process for updating the item element's embedding. As such, the particular training data required can vary based on what training algorithm or process is used by the embedding system (e.g., skip-gram negative sampling, *-space, etc.). Online data associated with the item element can include the number of training instances that have occurred for the item element, nearest neighbor information for the item element (e.g., item IDs and distances from the item element embedding for each nearest neighbor of the item element), and other information that may be of interest ( e.g., the most recent previous embedding for the item element or multiple previous embeddings for the item element)). Regarding claim 3, Ji in view of Green and Tang teach the computer-implemented method according to claim 1. Ji, Green, and Tang are combined in the same rationale as in claim 1 analogous claims 17 and 18. Green teaches wherein obtaining, by the client computing device from the server computing device, determining, by the client computing device, the slice of embeddings that include the one or more positive entities; and (Green Para 0043, The positive downsampling module 206 can be configured to determine [determining] whether or not a training instance will be useful for updating embeddings, i.e., to make a usefulness determination for each training instance. Unhelpful training instances can be skipped, i.e., if a training instance is determined to be unhelpful, that training instance will not be used for updating embeddings. For example, certain embedding elements that come up too frequently may not be useful in updating embeddings. An example of this includes the terms "the" or "I" when embedding language [by the client computing device, the slice of embeddings that include the one or more positive entities;].); requesting, by the client computing device and from the server computing device, the slice of embeddings (Green Para 0039, FIG. 2 illustrates an example embedding update module 202 configured to update embeddings based on training instances, according to an embodiment of the present disclosure. In some embodiments, the embedding update module 108 of FIG. 1 can be implemented as the embedding update module 202. As shown in the example of FIG. 2, the embedding update module 202 can include an item-context information retrieval module 204, a positive downsampling module 206, a negative sampling module 208, and a training module 210.). Para 0040 line 1-8, The item-context information retrieval module 204 can be configured to retrieve information associated with a training instance. In certain embodiments, a particular computing machine in the embedding system can receive a training call, i.e., an API call including a training instance. However, information relevant to the training call may be distributed across multiple computing machines in the embedding system [requesting, by the client computing device and from the server computing device]. Para 0049, The training module 210 can be configured to update embeddings based on a training instance. At this point, the training instance includes an item element [slice], one or more context elements, and one or more negative samples.) Regarding claim 7, Ji in view of Green and Tang teach the computer-implemented method according to claim 1. Ji, Green, and Tang are combined in the same rationale as in claim 1 analogous claims 17 and 18. Ji and Green do not explicitly teach wherein the one or more negative entities are identified by a similarity search relative to the one or more positive entities. Tang teaches wherein the one or more negative entities are determined by a similarity search relative to the one or more positive entities. (Tang Page 502 4 Word Level Sentiment Analysis 4.1 Querying Sentiment Words A better sentiment embedding should have the ability to map positive words into close vectors, to map negative words into close vectors, and to separate positive words and negative words apart. Accordingly, in the vector space of sentiment embedding, the neighboring words of a positive word like “good” should be dominated by positive words like “cool”, “awesome”, “great”, etc. [the one or more positive entities], and a negative word like “bad” should be surrounded by negative words like “terrible” and “nasty” [the one or more negative]. Based this consideration, we query neighboring sentiment words in existing sentiment lexicon to investigate whether sentiment embeddings are helpful in discovering similarities between sentiment words [determined by a similarity]. Specifically, given a sentiment word as input, we first find out the top Nw closest words in the sentiment lexicon. The closeness of two words is measured by the similarity (e.g. cosine) between their word embeddings. Afterwards, we calculate how much percentage of those neighboring words have the same sentiment polarity with the target sentiment word. (See Fig. 5)). Regarding claim 8, Ji in view of Green and Tang teach the computer-implemented method according to claim 1. Ji teaches wherein the client computing device is associated with a particular user; and the method further comprises: learning, by the client computing device and based at least in part on the local dataset, a user embedding associated with the particular user of the client computing device; and storing, by the client computing device, the user embedding locally at the client computing device (Ji Page 3. A. Preliminaries of Federated Learning Private Model Update. Private Model Update. Each online selected client receives the server model and performs secure local training on their own devices [the client computing device is associated with a particular user]. For the neural language modeling, stochastic gradient descent is performed to update their GRU-based client models which is introduced in Section III-C [learning]. After several epochs of training, the clients send the parameters of their models to the central server over a secure connection. During this local training, user data can be stored on their own devices [storing, by the client computing device, the user embedding locally at the client computing device]. Page 5. B. Baselines and Settings Para 2. The small model uses 300 dimensional word embedding and hidden state of RNN unit. We deploy models of three scales: small, medium and large with word embedding dimensions of 300, 650 and 1500, respectively. Tied embedding is applied to reduce the size of the model and decrease the communication cost [i.e. during training each client computing device is learning from the user embeddings].). Regarding claim 9, Ji in view of Green and Tang teach the computer-implemented method according to claim 8. Ji teaches wherein the user embedding is learned jointly with the updated values for the one or more of the slices of embeddings (Ji Page 3. A. Preliminaries of Federated Learning Private Model Update Para. 2 and 3. Central Model Update. The server firstly chooses a client learning model and initializes the parameters of the client learner. It sets the fraction of the clients. Then, it waits for online clients for local model training. Once the selected number of clients finishes the model update, it receives the updated parameters and performs the server optimization. The parameter sending and receiving consists of one round of communication. Our proposed optimization is conducted in Line 9 of Algorithm 1. Private Model Update. Each online selected client receives the server model and performs secure local training on their own. For the neural language modeling, stochastic gradient descent is performed to update their GRU-based client models which is introduced in Section III-C. After several epochs of training [learned jointly with the updated values for the one or more of the slices of embeddings], the clients send the parameters of their models to the central server over a secure connection. During this local training, user data can be stored on their own devices. Page 5. B. Baselines and Settings Para 2 line 3-6. The GRU based client model firstly takes texts as input, then embeds them into word vectors and feeds them to the GRU network. The last fully connected layer takes the output of GRU as input to predict the next word. The small model uses 300 dimensional word embedding and hidden state of RNN unit [the user embedding]). Regarding claim 11, Ji in view of Green and Tang teach the computer-implemented method according to claim 8. Ji, Green, and Tang are combined in the same rationale as in claim 1 analogous claims 17 and 18. Green teaches wherein learning, by the client computing device and based at least in part on the local dataset, the user embedding associated with the particular user of the client computing device comprises learning, by the client computing device and based at least in part on the local dataset, updated parameter values for a machine-learned user embedding model configured to generate the user embedding based on data descriptive of the user (Green Para. 0029, In certain embodiments, the online distributed embedded services and systems disclosed herein can be used to create embeddings [configured to generate] based on social networking system data. For example, in various embodiments, embeddings can be created for users of the social networking system based on user post engagement information indicative of engagements by users with content posts on the social networking system [user embedding based on data descriptive of the user]). Regarding claim 12, Ji in view of Green and Tang teach the computer-implemented method according to claim 11. JI teaches further comprising: communicating, by the client computing device, information descriptive of the updated parameter values for the machine-learned user embedding model to the server computing device without communicating the user embedding (Ji Page 3. A. Preliminaries of Federated Learning Para. 3 Private Model Update. Each online selected client receives the server model and performs secure local training on their own devices. For the neural language modeling, stochastic gradient descent is performed to update their GRU-based client models which is introduced in Section III-C. After several epochs of training, the clients send the parameters of their models to the central server over a secure connection [communicating, by the client computing device, information descriptive of the updated parameter values for the machine-learned user embedding model to the server computing device]. During this local training, user data can be stored on their own devices. Page 3, B. Attentive Federated Aggregation, The most important part of federated learning is the federated optimization on the server side which aggregates the client models. In this paper, a novel federated optimization strategy is proposed to learn federated learning from decentralized client models. We call this Attentive Federated Aggregation, or FedAtt for short. It firstly introduces the attention mechanism for federated aggregation by aggregating the layer-wise contribution of neural language models of selected clients to the global model in the central server. An illustration of our proposed layer-wise attentive federated aggregation is shown in Figure 1 where the lower box represents the distributed client models and the upper box represents the attentive aggregation in the central server. The distributed client models in the lower box contain several neural layers. The notations of “⊕” and “⊖” stand for the layer-wise operation on the parameters of neural models. This illustration shows only a single time step. The federated updating uses our proposed attentive aggregation block to update the global model by iteration. PNG media_image3.png 106 345 media_image3.png Greyscale (See Page 3, Algorithm 2 Secure Local Training on Client, Line 10) [without communicating the user embedding].). Regarding claim 13, Ji in view of Green and Tang teach the computer-implemented method according to claim 8. Ji teaches further comprising: performing, by the client computing device, on-device inference on the client computing device using the user embedding (Page 3. Private Model Update. Each online selected client receives the server model and performs secure local training on their own devices. For the neural language modeling, stochastic gradient descent is performed to update their GRU-based client models which is introduced in Section III-C. After several epochs of training, the clients send the parameters of their models to the central server over a secure connection. During this local training, user data can be stored on their own devices [on-device inference on the client computing device]. Page 5. B. Baselines and Settings Para 2 line 3-6. The GRU based client model firstly takes texts as input, then embeds them into word vectors and feeds them to the GRU network. The last fully connected layer takes the output of GRU as input to predict the next word. The small model uses 300 dimensional word embedding and hidden state of RNN unit [using the user embedding]). Regarding claim 14, Ji in view of Green and Tang teach the computer-implemented method according to claim 1. JI teaches wherein learning, by the client computing device and based at least in part on the local dataset, updated values for the one or more of the applying, by the client computing device, a learning rate to one or more of the updated values, the learning rate being inversely correlated to a frequency with which the one or more of the updated values is updated (Ji Page 3 Algorithm 2 PNG media_image4.png 305 465 media_image4.png Greyscale [applying, by the client computing device, a learning rate to one or more of the updated values,] Page 6. E. Communication Cost Para 1 line 9-11 and Para 2. Our method accelerates the training through the optimization of the global server as it can converge more quickly than its counterparts. To compare the efficiency of communication, we take the communication rounds during training as the evaluation metric in this subsection. Three factors are considered, i.e., the client fraction, epochs and batch size of client training. The results are shown in Figure 3 where the small-scaled language model is used as the client model and 10% of clients are selected for model aggregation. We set the testing perplexity for the termination of federated training to be 90. When the testing perplexity is lower than that threshold, federated training comes to an end and we take the rounds of training as the communication rounds. As shown in Figure 3(a), the communication round during training fluctuates when the number of client increases. Furthermore, our proposed method is always better than FedAvg with less communication cost. When the client fraction C chosen is 0.2 and 0.4, our proposed method saves a half of communication rounds. Then, we evaluate the effect of the local computation of clients on the communication rounds. We take the local training epochs to be 1, 5, 10, 15, and 20 and the local batch size to be from 10 to 50. We proposed FedAtt to achieve a comparable communication cost in the comparison of different values of epoch and the batch size of local training, as shown in Figure 3(b) and Figure 3(c) respectively. (i.e. clients adapt by running different number of epochs and batch size to increase learning rate)). Regarding claim 15, Ji in view of Green and Tang teach the computer-implemented method according to claim 1. Ji teaches wherein learning, by the client computing device and based at least in part on the local dataset and the slice of embeddings, updated values for the one or more of the slice of embeddings (Ji Page 3. III. PROPOSED METHOD A. Preliminaries of Federated Learning Private Model Update. Para 3, Private Model Update. Each online selected client receives the server model and performs secure local training on their own devices. For the neural language modeling, stochastic gradient descent is performed to update their GRU-based client models which is introduced in Section III-C. After several epochs of training, the clients send the parameters of their models to the central server over a secure connection. During this local training, user data can be stored on their own devices. Ji Page 4. III. PROPOSED METHOD C. GRU-based client Model, The learning process on the client side is model-agnostic. For different tasks, we can choose appropriate models in specific situations. In this paper, we use the gated recurrent unit (GRU) [8] for learning the language modeling on the client side. The GRU is a well-known and simpler variant of the Long Short-Term Memory (LSTM) [7], by merging the forget gate and the input gate into a single gate as well as the cell state and the hidden state. In the GRU-based neural language model, words or tokens are firstly embedded into word vectors denoted as X = fx0; x1; : : : ; xt; : : : g and then put into the recurrent loops [updating, by the client computing device, one or more weights respectively associated with the updated values.].). Regarding claim 16, Ji in view of Green and Tang teach the computer-implemented method according to claim 1. Ji teaches further comprising: communicating, by the client computing device, the one or more weights respectively associated with the updated values to the server computing device (Ji Page 3-4 III. Proposed Method B. Attentive Federated Aggregation Para 2, PNG media_image5.png 422 453 media_image5.png Greyscale The intuition behind the federated optimization is to find an optimal global model that can generalize the client models well. In our proposed optimization algorithm, we take it as finding an optimal global model that is close to the client models in parameter space while considering the importance of selected client models during aggregation. The optimization objective is defined as (See Function 1) where θ t is the parameters of the global server model [determining, by the server computing device] at time stamp t, θ t + 1 k is the parameters of the k-th client model at time stamp t + 1, L(.; .) is defined as the distance between two sets of neural parameters, and k is the attentive weight to measure the importance of weights for the client models. The objective is to minimize the weighted distance between server model and client models by taking a set of self-adaptive scores as the weights [communicating, by the client computing device, the one or more weights respectively associated with the updated values to the server computing device]). Regarding claim 21, Ji in view of Green and Tang teach the computer-implemented method according to claim 1. Ji teaches wherein the embeddings are provided as inputs to a machine-learned model (Ji Page 5 B. Baselines and Settings, para 2 line 3-11, The GRU based client model firstly takes texts as input, then embeds them into word vectors and feeds them to the GRU network. The last fully connected layer takes the output of GRU as input to predict the next word. The small model uses 300 dimensional word embedding and hidden state of RNN unit. We deploy models of three scales: small, medium and large with word embedding dimensions of 300, 650 and 1500, respectively [embeddings are provided as inputs to a machine-learned model]. Tied embedding is applied to reduce the size of the model and decrease the communication cost .). Regarding claim 22, Ji in view of Green and Tang teach the computer-implemented method according to claim 1. Ji, Green, and Tang are combined in the same rationale as in claim 1 analogous claims 17 and 18 Green teaches wherein the positive entity is an entity associated with at least one user interaction (Green para 0055, The presently disclosed systems and methods provide numerous advantages over conventional systems and methods. For example, in the context of a social networking system, the presently disclosed inventions may allow for embedding of pages based on page visits [positive entity]. Users of social networking system may have fanned billions of pages on the social networking system. Collection of this data, sequencing, analyzing, and training based on this data could take weeks to months using conventional systems. However, using the currently disclosed systems and methods, page embeddings could be updated incrementally using training instances as described above, and page embeddings would always be online and available [associated with at least one user interaction]). Claim(s) 5, 6 , and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Ji in view of Green and Tang and further in view of Chang et al. Content-aware hierarchical point-of-interest embedding model for successive POI recommendation. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI'18). AAAI Press, 3301–3307 (“Chang”). Regarding claim 5, Ji in view of Green and Tang teach the computer-implemented method according to claim 1 and analogous 17. Ji, Green, and Tang are combined in the same rationale as in claim 1 analogous claims 17 and 18. Ji does not explicitly teach the one or more slices of the vocabulary of embeddings comprise one or more geographic slices of a vocabulary of location embeddings; and each geographic slice includes a respective subset of embeddings that correspond to locations within a common geographic area. However Chang teaches the slice of embeddings comprises one of a plurality of the vocabulary of location embeddings that correspond to locations within a common geographic area (Change Page. 3303, Data Description and Analysis, 3.1 Data Description Para 1. line 9-20, Therefore, we constructed a new dataset that contains text contents which refer to POIs. We collected data from Instagram, which is one of the most popular mobile-based social networks. Instagram data includes not only user POI check-in information, but also text content written by users. We collected Instagram data created in New York City and preprocessed the collected data utilizing the same method of Zhao et al. [Zhao et al., 2017] [each geographic slice includes a respective subset of the vocabulary of location embeddings that correspond to locations within a common geographic area]. We removed the POIs with less than five checked-ins and the users who had less than ten posts. After preprocessing, our new dataset1 includes 2,216,631 check-ins at 13,187 POIs of 78,233 users. The statistics of our dataset are summarized in Table 1. Page 3303, 3 Data Description and Analysis 3.2 Empirical Analysis Textual Influence on POIs Line 1-11, We qualitatively analyze the relationship between words and POIs in the dataset. We measure the correlation between the words by calculating the Jaccard similarity of POIs in their text content. We first remove stop-words and construct a set of POIs for each word . Only POIs whose text content mentions a word more than five times are included in the POI set of the word. For example, when word w1 is used in the text content of POIs l1, l2, and l3, the POI set of word w1 is fl1; l2; l3g. We make the POI sets for the top 2000 words sorted by frequency, and calculate the average Jaccard similarity value for all the pairs of words (Top 2000 Words [slice of embeddings comprises one of a plurality of ]).). Ji, Green, and Chang are considered to be analogous to the claimed invention because they are in the same field of invention of learning from embeddings. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filling date of the claimed invention to have modified Ji and Green to incorporate the teaches of Chang and use location embeddings associated with a geographic location. Doing so to understand user behavior and preferences (Chang Page 3301 1 Introduction Para 1 and 2, In recent years, many mobile-based social media platforms such as Twitter, Facebook, Instagram, and Foursquare have grown in popularity. Users on these platforms generate a large amount of data which includes text content with temporal and spatial information. Such information is particularly useful for understanding user behavior and preferences for a point-of-interest (POI) which is a specific location that someone finds interesting. In mobile-based applications, it is necessary but challenging to recommend where a user will visit next based on temporal and spatial information.). Regarding claim 6, Ji in view of Green, Tang, Chang teach the computer-implemented method according to claim 5. Ji and Green do not explicitly teach to one or more map tiles covering respective portions of a map of the Earth. However Chang teaches wherein the one or more slices correspond to one or more map tiles covering respective portions of a map of the Earth (Change Page. 3303, Data Description and Analysis, 3.1 Data Description Para 1. line 9-20, Therefore, we constructed a new dataset that contains text contents which refer to POIs. We collected data from Instagram, which is one of the most popular mobile-based social networks. Instagram data includes not only user POI check-in information, but also text content written by users. We collected Instagram data created in New York City and preprocessed the collected data utilizing the same method of Zhao et al. [Zhao et al., 2017] [one or more map tiles covering respective portions of a map of the Earth]. We removed the POIs with less than five checked-ins and the users who had less than ten posts. After preprocessing, our new dataset1 includes 2,216,631 check-ins at 13,187 POIs of 78,233 users. The statistics of our dataset are summarized in Table 1.) The motivation utilized in the combination of claim 5 equally applies to claim 6. Regarding claim 19, Ji in view of Green and Tang teach the computer-implemented method according to claim 18. Ji, Green, and Tang are combined in the same rationale as in claim 1 analogous 17. JI do not explicitly teach the respective slices of the vocabulary of embeddings comprise respective geographic slices of the vocabulary embeddings that correspond to respective sets of embeddings associated with entities included in respective geographic areas. However Chang teaches wherein the a shared geographic area [[s]] (Change Page. 3303, Data Description and Analysis, 3.1 Data Description Para 1. line 9-20, Therefore, we constructed a new dataset that contains text contents which refer to POIs. We collected data from Instagram, which is one of the most popular mobile-based social networks. Instagram data includes not only user POI check-in information, but also text content written by users. We collected Instagram data created in New York City and preprocessed the collected data utilizing the same method of Zhao et al. [Zhao et al., 2017] [geographic slice[[s]] of the vocabulary of embeddings]. We removed the POIs with less than five checked-ins and the users who had less than ten posts. After preprocessing, our new dataset1 includes 2,216,631 check-ins at 13,187 POIs of 78,233 users. The statistics of our dataset are summarized in Table 1. Page 3303, 3 Data Description and Analysis 3.2 Empirical Analysis Textual Influence on POIs Line 1-11, We qualitatively analyze the relationship between words and POIs in the dataset. We measure the correlation between the words by calculating the Jaccard similarity of POIs in their text content. We first remove stop-words and construct a set of POIs for each word . Only POIs whose text content mentions a word more than five times are included in the POI set of the word. For example, when word w1 is used in the text content of POIs l1, l2, and l3, the POI set of word w1 is fl1; l2; l3g. We make the POI sets for the top 2000 words sorted by frequency, and calculate the average Jaccard similarity value for all the pairs of words (Top 2000 Words) [that corresponds to a shared geographic area [[s]]]). The motivation utilized in the combination of claim 5 equally applies to claim 19. Claim(s) 10 are rejected under 35 U.S.C. 103 as being unpatentable over Ji in view of Green and Tang and further in view of Mandt et al. (US20180157644A1) (“Mandt”). Regarding claim 10, Ji in view of Green and Tang teach the computer-implemented method according to claim 8. Ji, Green, and Tang are combined in the same rationale as in claim 8. Ji and Green do not explicitly teach wherein the user embedding comprises a context matrix within the machine-learned model. However Mandt teaches wherein the user embedding comprises a context matrix within the machine-learned model (Mandt Para 0025, The dynamic embedding model 104 [machine-learned model] is composed of a list of the L most frequent words, called the vocabulary, and a plurality of pairs of word embedding matrices 105 1 - T and context embedding matrices 106 1 - T . Each word embedding matrix contains a plurality of word embedding vectors, one word embedding vector for each word in the vocabulary. Similarly, each context embedding matrix [a context matrix] contains a plurality of context embedding vectors, one context embedding vector for each word in the vocabulary.), Ji and Mandt are considered to be analogous to the claimed invention because they are in the same field of invention of learning from embeddings. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filling date of the claimed invention to have modified Ji to incorporate the teaches of Mandt and incorporate a context matrix with the machine learning model. Doing so to allow the model to determine how the meaning of words change over time. (Mandt Para 0025, Each word embedding matrix 105 1 - T and context embedding matrix 106 1 - T pair is connected to allow the dynamic embedding model 104 determine how the meaning of words change over time.). Pertinent Prior Art The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Appu et al. (US20160098844A1) – teaches positive and negative images and finding the embedding space. Conclusion Any inquiry concerning this communication or earlier communications from the examiner should be directed to ALFREDO CAMPOS whose telephone number is (571)272-4504. The examiner can normally be reached 7:00 - 4:00 pm M - F. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael J. Huntley can be reached at (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /ALFREDO CAMPOS/Examiner, Art Unit 2129 /MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129
Read full office action

Prosecution Timeline

Apr 21, 2022
Application Filed
May 29, 2025
Non-Final Rejection — §103
Sep 10, 2025
Response Filed
Oct 15, 2025
Final Rejection — §103
Dec 01, 2025
Interview Requested
Dec 09, 2025
Examiner Interview Summary
Dec 09, 2025
Applicant Interview (Telephonic)
Jan 23, 2026
Request for Continued Examination
Jan 28, 2026
Response after Non-Final Action
Feb 20, 2026
Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12561407
ONE-PASS APPROACH TO AUTOMATED TIMESERIES FORECASTING
2y 5m to grant Granted Feb 24, 2026
Patent 12561559
Neural Network Training Method and Apparatus, Electronic Device, Medium and Program Product
2y 5m to grant Granted Feb 24, 2026
Patent 12554973
HIERARCHICAL DATA LABELING FOR MACHINE LEARNING USING SEMI-SUPERVISED MULTI-LEVEL LABELING FRAMEWORK
2y 5m to grant Granted Feb 17, 2026
Patent 12536260
SYSTEM, APPARATUS, AND METHOD FOR AUTOMATICALLY GENERATING NEGATIVE KEYSTROKE EXAMPLES AND TRAINING USER IDENTIFICATION MODELS BASED ON KEYSTROKE DYNAMICS
2y 5m to grant Granted Jan 27, 2026
Study what changed to get past this examiner. Based on 4 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

3-4
Expected OA Rounds
83%
Grant Probability
99%
With Interview (+33.3%)
3y 9m
Median Time to Grant
High
PTA Risk
Based on 6 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month