Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Remarks
This Office Action is in response to applicant’s amendment filed on November 6, 2025, under which claims 1-20 are pending and under consideration.
Response to Arguments
Applicant’s amendments have overcome the previous § 112(b) rejection. Therefore, the previous grounds of rejection for § 112(b) have been withdrawn. However, new issues pertaining to § 112(b) are raised in the current action.
Applicant’s amendments have overcome the previous § 103 rejections. However, upon further consideration, the claims have been given a new ground of rejection under § 103 based on a newly applied reference. Applicant’s arguments are moot under the new ground of rejection.
Applicant’s arguments directed to the § 101 rejection have been fully considered but are not deemed to be persuasive. Applicant argues:
Based on the determination result, "switching between two training data compressors [is possible] according to the possibility of entering an overfit regime." ([0026] of the specification as originally filed. Thereby, "communication bandwidth consumed, by sending minimal information from the nodes to a central node [is reduced], while also maintaining a good generalization of model instances at each of a plurality of nodes by a smart switching of data compressor type after automatic detection of possible overfitting of a respective model instance at one or more nodes." ([0017].) Thus, the alleged abstract idea, if there is any, is believed to be incorporated into a practical application. Therefore, Applicant respectfully submits that independent claims 1 and 11 are patent eligible.
Applicant’s response, page 10.
This argument is not persuasive because the claim does not recite or reflect the features discussed in applicant’s response. For example, the claim does not recite “switching between two training data compressors,” since there is no aspect or context of performing training of any model in the claim. Instead, the claim merely recites the transmission of information to a central node, and this information has no specific application in regards to any training process.
Furthermore, although the claim does recite compression, the type of compression that is being described here encompasses operations that can be performed as a mental process. For example, “compressing” a numerical vector by reducing its values to +/- signs can be performed by a human. Similarly, discarding elements in a vector can also be performed by a human. Therefore, the compression itself is not a practical application distinct from a mental process, in the absence of further limitations that define the training process.
Therefore, the claims remain rejected over § 101 over substantially the same reasons as given in the previous office action. To advance prosecution in view of the § 101 rejection, the Examiner suggests reciting details pertaining to the training of the model.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
In claim 1, it is unclear whether the following two steps of the claim beginning with “in a case where it is determined” are (1) alternative contingent limitations where only one of the two steps need to be performed such that the other is step is an optional limitation; (2) steps that are both required to be performed in order to satisfy the method of the claim; or (3) a statement of system capability, wherein the system is capable of performing both steps but only one step is performed per determination. The claim is unclear as to which of these three interpretations is the proper interpretation, for the reasons further discussed below.
In more detail, the terminology of “in a case where it is determined” is not explicit contingent limitation terminology (i.e., “if x, then y”), but the context of the conditions (i.e., “is not overfitting” and “is overfitting”) could be interpreted as contingent limitations. Therefore, the claim could potentially be interpreted as reciting alternative contingent limitations. Note that in a method claim, contingent limitations that are not required to be performed do not need to be met by the prior art (see MPEP § 2111.04(II)). However, a different interpretation is also possible in view of the phrase “to generate a second vector, which includes the first vector.” This limitation suggests that the second step is performed after the first step, since the first vector is generated in the second. Thus, in a potential plain reading of the claim, the second vector requires the existence of the first vector, resulting in an interpretation that both steps are performed (i.e., interpretation (2) listed above). Yet another possible interpretation (interpretation (3)) is that the two “in a case where it is determined” steps are merely capabilities of a system in performing different steps under different conditions (in the manner akin to a system claim as discussed in MPEP § 2111.04(II)). Even though the claim language of claim 1 does not explicitly define those two steps as system capabilities, it could be interpreted as such under the context of “in an edge node…performing operations comprising.” Since the claim is unclear as to which of these three interpretations is intended the claim is considered to be indefinite.
For purposes of examination, claim 1 has been interpreted as requiring system capabilities of performing both the “in a case where it is determined” steps and the phrase “to generate a second vector, which includes the first vector” is treated as referring to a hypothetical “first vector” that does not need to exist. Note that that this interpretation would also account, in terms of application of prior art in the § 103 rejections, for the broader interpretation of the two “in a case where it is determined” steps being alternate contingent method steps.
To resolve this issue, the examiner suggests amending claim 1 to recite:
A method, comprising:
in an edge node, of a group of edge nodes that are each operable to communicate with a central node, performing operations comprising:
generating a vector that includes gradients associated with model instance, of a central model, that is operable to run at the edge node; and
determining whether the model instance is overfitting to data generated at the edge node;
wherein the edge node is configured to:
in a case where it is determined that the model instance is not overfitting to the data, perform sign compression on the vector to generate a first vector when overfitting is not indicated and transmit the first vector, after compression, to the central node that includes the central model; and
in a case where it is determined that the model instance is overfitting to the data, perform percentage sign compression on the vector to generate a second vector, which includes and a portion of the gradients, and transmit the second vector, after compression, to the central node that includes the central model.
Claim 11 is also rejected as being indefinite for part of the reasons discussed above. Specifically, it is unclear whether two steps of the claim beginning with “in a case where it is determined” are both required to be performed due to the phrase “to generate a second vector, which includes the first vector” which is confusing as to whether generation of the second vector requires performing the previous step to generate the first vector. For purposes of examination, the phrase “to generate a second vector, which includes the first vector” is treated as referring to a hypothetical “first vector” that does not need to exist. To resolve this issue, the examiner suggests deleting “the first vector” in claim 11.
The remaining dependent claims 2-10 and 12-20 are rejected due to their dependency on claim 1 or 11. The Examiner notes that the above suggested amendment to claims 1 and 11 may also necessitate revision to the language of claims 3 and 13.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-20 are rejected under 35 U.S.C. § 101 because the claimed invention is directed to an abstract idea without significantly more.
Independent Claims
Step 2A Prong One: Does the claim recite an abstract idea, law of nature, or natural phenomenon?
Yes, independent claim 1, for example, recites an abstract idea in the form of a mental process. A mental process is a process that “can be performed in the human mind, or by a human using a pen and paper” (MPEP § 2106.04(a)(2)(III), paragraph 1). Examples of mental processes include “observations, evaluations, judgments, and opinions” (MPEP § 2106.04(a)(2)(III), paragraph 2).
The following limitations of claim 1 are considered to be mental processes:
“generating a vector that includes gradients associated with […];” [This is a mental process that can be performed by observation, evaluation, judgment, and opinion. Although this limitation recites “gradients,” it does not recite an actual step of computing those gradients, much less using a computer-implemented process such as model training. Instead, this limitation merely recites generating a vector that includes gradients, which encompasses a mental process of determining a vector of certain numerical values. Furthermore, the claim does not define the “vector” to have any specific size or complexity that precludes the generation of it from being a mental process, particularly when pen and paper are available.]
“determining whether the model instance is overfitting to data generated at the edge node” [This is a mental process that can be performed by observation, evaluation, judgment, and opinion. The step of “determining” is recited at a high degree of generality, without reference to any specific technical methodology that is precluded from being a mental process.]
“in a case where it is determined that the model instance is not overfitting to the data, performing sign compression on the vector to generate a first vector when overfitting is not indicated” [This is a mental process that can be performed by observation, evaluation, judgment, and opinion. Although the claim recites “compression,” the compression in this context merely requires simple operations such as observing the sign of a numerical value, or discarding values from the vector. As such, these operations encompass what can be performed as a mental process, particularly with the use of physical aid.]
“in a case where it is determined that the model instance is overfitting to the data, performing percentage sign compression on the vector to generate a second vector, which includes the first vector and a portion of the gradients” [This is a mental process that can be performed by observation, evaluation, judgment, and opinion. Although the claim recites “compression,” the compression in this context merely requires simple operations such as observing the sign of a numerical value, or discarding values from the vector. As such, these operations encompass what can be performed as a mental process, particularly with the use of physical aid.]
The above analysis is also applied to the other independent claim 11 which recites features that are the same or analogous to those discussed above.
Step 2A Prong Two: Does the claim recite additional elements that integrate the judicial exception into a practical application?
Independent claims 1 and 11 recite the following additional elements, but these additional elements are not sufficient to integrate the judicial exception into a practical application, as analyzed below:
“in an edge node, of a group of edge nodes that are each operable to communicate with a central node, performing operations comprising” and the related limitation of “a model instance, of a central model, that is operable to run at the edge node” (claims 1 and 11) [These elements do no more than generally link the use of a judicial exception to a particular technological environment or field of use (MPEP § 2106.05(h)), namely the technological environment of a distributed learning system in which there are edge and central nodes that manage model instances and a central model. Note that the claim recites “associated with a model instance,” where the term “associated with” only specifies a general link to the model instance, and not, for example, the execution of a training process on the model instance. Additionally, to the extent that the “edge node” requires a computing device to perform the operations, this aspect would amount no more than mere instructions to apply an exception using a generic computing device (MPEP § 2106.04(d)(I)).]
“transmitting the first vector, after compression, to the central node that includes the central model” and “transmitting the second vector, after compression, to the central node that includes the central model” (claims 1 and 11) [These elements constitute “adding insignificant extra-solution activity to the judicial exception” (MPEP § 2106.05(g)). This limitation does not meaningfully limit the claim because it merely amounts to amounts to necessary data gathering and outputting, which is a category of insignificant extra-solution activity identified in MPEP § 2106.05(g).]
“A non-transitory storage medium having stored therein instructions that are executable by one or more hardware processors to perform operations” (claim 11) [This recitation constitutes no more than mere instructions to apply the judicial exception using generic computer components (MPEP § 2106.04(d)(I)). These additional elements do not meaningfully limit the claim because they merely invoke the use of generic computer components, such as a computer readable medium and a processor, as tools to perform the abstract idea.]
Therefore, under MPEP 2106.04(d), the additional elements of the claims do not integrate the judicial exception into a practical application.
Step 2B: Does the claim recite additional elements that amount to significantly more than the judicial exception?
No. The claims do not include additional elements that are sufficient for the claims to amount to significantly more than the judicial exception.
Additional elements that are mere instructions to apply an exception or merely generally linking or generally linking the use of a judicial exception to a particular technological environment or field of use do not constitute significantly more than a judicial exception under MPEP § 2106.05(I)(A).
Additional elements that are considered to be extra-solution activity do not amount to significantly more if, upon their reevaluation in Step 2B, they are also merely appending “well-understood, routine, conventional activities previously known to the industry, specified at a high level of generality, to the judicial exception” (MPEP § 2106.05(I)(A)). Here, the additional elements that were previously identified as extra-solution activity are well-understood, routine, conventional activity for the following reasons.
The limitation of “transmitting the vector, after compression, to the central node that includes the central model” is well-understood, routine, conventional activity because it is merely a limitation of “receiving or transmitting data over a network” recognized in MPEP § 2106.05(d)(II) as being well‐understood, routine, and conventional computer functions.
Dependent Claims
The remaining dependent claims do not recite additional elements, whether considered individually or in combination, that are sufficient to integrate the judicial exception into a practical application or amount to significantly more than the judicial exception.
The limitations of the dependent claims are analyzed below:
Claim 2: “wherein the first vector is generated to include respective signs of the gradients, but not the gradients themselves.” [This merely further defines an element that is involved in existing mental process in a manner such that the existing mental process still remains as a mental process. A human is capable of creating a vector of signs, particular with the use of pen and paper. This claim does not recite any non-abstract additional elements for purposes of Step 2A Prong Two and Step 2B analysis.]
Claim 3: “wherein performing percentage sign compression comprises: randomly removing one or more signs from the first vector to generate the second vector that is transmitted to the central node.” [This is a mental process that can be performed by observation, evaluation, judgment, and opinion. A human is capable of creating a vector of signs, particular with the use of pen and paper, and is also capable of randomly discarding such signs.]
Claim 4: “wherein the signs removed from the first vector are removed based on a user-defined parameter (α) that provides a reduction factor applied to the vector.” [This limitation merely further defines the mental process recited in the parent claim and is therefore considered to be part of the mental process of the parent claim. This claim does not recite any non-abstract additional elements for purposes of Step 2A Prong Two and Step 2B analysis.]
Claim 5: “wherein signs remaining in the first vector maintain a same order as in a uncompressed vector.” [This limitation merely further defines the mental process recited in the parent claim and is therefore considered to be part of the mental process of the parent claim. This claim does not recite any non-abstract additional elements for purposes of Step 2A Prong Two and Step 2B analysis.]
Claim 6: “wherein presence or lack of overfitting is determined based on a slope of […] that includes validation data points generated at the edge node” [This limitation merely further defines the mental process recited in the parent claim and is therefore considered to be part of the mental process of the parent claim. While this claim recites the use of a slope of a linear regression, the claim presumes that the slope already exists, and there is no limitation of actually computing the slope or performing data validation.] “a linear regression.” [This limitation recites an abstract idea in the form of a mathematical concept, since a linear regression of validation data points is a mathematical concept in the form of a mathematical computation.]
Claim 7: “wherein when sign compression is performed, a scaling factor is applied to signs in the first vector.” [This is a mental process that can be performed by observation, evaluation, judgment, and opinion. In this context, applying a scaling factor merely requires associating the scaling factor with the signs, which can be performed by a human.]
Claim 8: “wherein each gradient corresponds to a respective aspect of a configuration, update, and/or operation, of the model instance.” [This limitation merely further defines the mental process recited in the parent claim and is therefore considered to be part of the mental process of the parent claim. This claim does not recite any non-abstract additional elements for purposes of Step 2A Prong Two and Step 2B analysis.]
Claim 9: “wherein the operations further comprise receiving, by the edge node from the central node, an updated central model” [These elements are additional elements besides the abstract idea but constitute merely “adding insignificant extra-solution activity to the judicial exception” (MPEP § 2106.05(g)). This limitation does not meaningfully limit the claim because it merely amounts to amounts to necessary data gathering and outputting, which is a category of insignificant extra-solution activity identified in MPEP § 2106.05(g). Furthermore, for purposes of Step 2B analysis, these elements are well-understood, routine, conventional activity because it is merely a limitation of “receiving or transmitting data over a network” recognized in MPEP § 2106.05(d)(II) as being well‐understood, routine, and conventional computer functions.] “that was created in part based on the first or second vector sent by the edge node to the central node.” [These elements do no more than generally link the use of a judicial exception to a particular technological environment or field of use (MPEP § 2106.05(h)), namely the technological environment of a distributed learning system in which there are edge and central nodes that manage model instances and a central model.]
Claim 10: “wherein the first or second vector sent by the edge node to the central node is decompressible with a scaling factor” [This merely further defines an element that is involved in existing mental process in a manner such that the existing mental process still remains as a mental process. The mere fact that a vector is decompressible does not require any specific technical implementation distinguishing over a mental process. This claim does not recite any non-abstract additional elements for purposes of Step 2A Prong Two and Step 2B analysis.]
Claims 12-20 recites features that are the same or substantially the same as those of claims 2-10, and are therefore given the same analysis that was given for the corresponding claims discussed above.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
1. Claims 1-4, 8-9, 11-12, and 18-19 are rejected under 35 U.S.C. 103 as being unpatentable over Pezeshki et al. (US 2022/0124518 A1) (“Pezeshki”) in view of Turkson et al., “Unsupervised Multi-layer Spiking Convolutional Neural Network Using Layer-Wise Sparse Coding,” in H. Yang et al. (Eds.): ICONIP 2020, LNCS 12534, pp. 353–365, 2020 and Basu et al., “Qsparse-local-SGD: Distributed SGD with Quantization, Sparsification, and Local Computations,” arXiv:1906.02367v2 [stat.ML] 2 Nov 2019 (“Basu”).
As to claim 1, Pezeshki teaches a method, comprising:
in an edge node, of a group of edge nodes that are each operable to communicate with a central node, performing operations comprising: [[0055]: “FIG. 3 is a diagram illustrating an example 300 of federated learning, in accordance with the present disclosure. As shown, a base station 305 may communicate with a set of UEs 310.” [0059]: “‘Federated learning round’ refers to the training done by a UE 310 that corresponds to an update provided by the UE 310 to the base station 305.” [0065]: “the base station 305 (e.g., using the second communication manager 325) may aggregate the updates received from the UEs 310.” That is, the base station corresponds to a central node, while the UEs are edge nodes.]
generating a vector that includes gradients associated with a model instance, of a central model, that is operable to run at the edge node; [[0062]: “The first communication manager 320 may perform one or more SGD procedures to determine the optimized parameters w(n) and may determine the gradients, gk(n)∇Fk(w(n)), of the loss function Fk(w).” The gradients are of the model that the UE received (see [0058]: “The UEs 310 may locally train the machine learning component using training data collected by the UEs, respectively.”), which is the distributed instance of the central model at the base station as shown in FIG. 3.] […],
[…] performing sign compression on the vector to generate a first vector […] and transmitting the first vector, after compression, to the central node that includes the central mode; and [[0040]: “The term “update resolution” refers to the accuracy of the quantized updates (e.g., gradient vectors). For example, one quantization scheme that may be used is a sign stochastic gradient descent (signSGD) in which gradient elements are quantized to a single bit such that only the sign corresponding to each (e.g., “+” or “−”) is transmitted.” [[0064]: “As shown by reference number 340, the UEs 310 may transmit their respective local updates… In some aspects, the local update may include a compressed version of a local update.”]
[…] performing […] sign compression on the vector to generate a first vector […] and transmitting the first vector, after compression, to the central node that includes the central model. [[0040]: “The term “update resolution” refers to the accuracy of the quantized updates (e.g., gradient vectors). For example, one quantization scheme that may be used is a sign stochastic gradient descent (signSGD) in which gradient elements are quantized to a single bit such that only the sign corresponding to each (e.g., “+” or “−”) is transmitted.” [[0064]: “As shown by reference number 340, the UEs 310 may transmit their respective local updates… In some aspects, the local update may include a compressed version of a local update.”]
Pezeshki does not explicitly teach:
(1) “determining whether the model instance is overfitting to data generated at the edge node”;
(2) the conditions of “in a case where it is determined that the model instance is not overfitting to the data” and “in a case where it is determined that the model instance is overfitting to the data” and the related limitations of “when overfitting is not indicated” and “when overfitting is indicated” for those conditions;
(3) the compression being a “percentage sign compression” in the case when overfitting is indicated.
Turkson teaches “determining whether the model instance is overfitting to data generated at the edge node” [§ 1, last paragraph: “To avoid overfitting, we included a dynamic dropout technique as a mask to randomly disable activation of unit and to shorten the training time.” § 2.4: “Overfitting normally occurs when training accuracy is above 99%, and at this stage, the testing accuracy neither increase nor decrease. Therefore, adding a dynamic dropout when the accuracy of the training set reaches 97% to prevent co-adaptation of neurons.” In other words, an accuracy of the training set above the threshold constitutes a determination of overfitting. Note that the feature of “data generated at the edge node” is already taught by the base reference in the form of collected training data (Pezeshki, [0058]: “The UEs 310 may locally train the machine learning component using training data collected by the UEs, respectively.”), and the training data in this further reference is analogous to this feature since it is also training data.] and the conditions of “in a case where it is determined that the model instance is not overfitting to the data” and “in a case where it is determined that the model instance is overfitting to the data” and the related limitations of “when overfitting is not indicated” and “when overfitting is indicated” for those conditions [§ 2.4: “Overfitting normally occurs when training accuracy is above 99%, and at this stage, the testing accuracy neither increase nor decrease. Therefore, adding a dynamic dropout when the accuracy of the training set reaches 97% to prevent co-adaptation of neurons.” That is, the determination overfitting is taught in conjunction with both the conditions of overfitting detected (accuracy reaches 97%) and when overfitting is not detected (accuracy does not reach 97%).]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Pezeshki with the teachings of Turkson by implementing the step of “determining whether the model instance is overfitting to data generated at the edge node” so as to result in the conditions of “in a case where it is determined that the model instance is not overfitting to the data” and “in a case where it is determined that the model instance is overfitting to the data” and the related limitations of “when overfitting is not indicated” and “when overfitting is indicated” for those conditions, as recited in the instant claim. Doing so would have enabled assessment of overfitting and action to be taken during a training process when overfitting is detected (see Turkson, § 1, last paragraph).
Basu, which pertains to gradient compression techniques for distributed machine learning (see abstract) including federated learning (see § 1, paragraph 2), teaches “percentage sign compression” [§ 2, paragraph 2: “In quantization, we reduce precision of the gradient vector by mapping each of its components by a deterministic or randomized map to a finite number of quantization levels. In sparsification, we sparsify the gradients vector before using it to update the parameter vector, by taking its top k components, denoted by Topk, or choosing k components uniformly at random, denoted by Randk.” See also § 2.2, paragraph 1: “For any x ∈ R, Topk(x) is equal to a d-length vector, which has at most k non-zero components whose indices correspond to the indices of the largest k components (in absolute value) of x. Randk(x) is a d-length (random) vector, which is obtained by selecting k components of x uniformly at random.” See also § A.1, text below equation (23): “Compk(x) is a length-d vector, but only (at most) k of its components are non-zero.” Furthermore, Basu teaches that in the context of the “k” selections, values not selected are not transmitted. See page 20, top paragraph: “the Q operator from [AGL+17] further induces sparsity, which results in fewer than k coordinates being transmitted”; § A.1, text below equation (23): “Observe that for any Compk ∈ {Topk, Randk}, Compk(x) is a length-d vector, but only (at most) k of its components are non-zero. This implies that, by treating Compk(x) a length-k vector whose entries correspond to the k non-zero entries of x” (page 20, top paragraph). Therefore, Basu, in combination with Pezeshki teaches a compression in which only a percentage of the gradient values (which are signs) are transmitted.]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of the references combined thus far with the teachings of Basu by applying the Randk sparsification technique to the sign vectors so as to arrive at the limitations of “percentage sign compression” as recited in the instant claim. The motivation would have been to increase the communication efficiency of gradients, as suggested by Basu (see § 1.1, last sentence 1: “mitigating the communication bottlenecks in training high dimensional models over bandwidth limited networks”). Furthermore, one of ordinary skill in the art would have recognized that sparsification of gradients in Basu and the dropout described in Turkson are both are in the general category of sparsity techniques that can be used to avoid or mitigate the effects of overfitting.
As to claim 2, the combination of Pezeshki, Turkson, and Basu teaches the method as recited in claim 1, wherein the first vector is generated to include respective signs of the gradients, but not the gradients themselves. [Pezeshki, [0040]: “The term “update resolution” refers to the accuracy of the quantized updates (e.g., gradient vectors). For example, one quantization scheme that may be used is a sign stochastic gradient descent (signSGD) in which gradient elements are quantized to a single bit such that only the sign corresponding to each (e.g., “+” or “−”) is transmitted.”]
As to claim 3, the combination of Pezeshki, Turkson, and Basu teaches the method as recited in claim 1, as set forth above.
Basu further teaches “wherein performing percentage sign compression comprises: randomly removing one or more signs from the first vector to generate the second vector that is transmitted to the central node” [§ 2, paragraph 2: “In quantization, we reduce precision of the gradient vector by mapping each of its components by a deterministic or randomized map to a finite number of quantization levels. In sparsification, we sparsify the gradients vector before using it to update the parameter vector, by taking its top k components, denoted by Topk, or choosing k components uniformly at random, denoted by Randk.” See also § 2.2, paragraph 1: “For any x ∈ R, Topk(x) is equal to a d-length vector, which has at most k non-zero components whose indices correspond to the indices of the largest k components (in absolute value) of x. Randk(x) is a d-length (random) vector, which is obtained by selecting k components of x uniformly at random.” See also § A.1, text below equation (23): “Compk(x) is a length-d vector, but only (at most) k of its components are non-zero.” While Basu does not explicitly teach the signs quantized format, the limitation of “one or more signs” is already taught by the base reference Pezeshki, and the application of sparsification to the vector in Pezeshki results in the instant limitation. In regards to the limitation of “removing,” in the context of sparsification, non-selection constitutes removal because non-selected values are set to zero in order for the vector to be sparse. Furthermore, Basu teaches that in the context of the “k” selections, values not selected are not transmitted. See page 20, top paragraph: “the Q operator from [AGL+17] further induces sparsity, which results in fewer than k coordinates being transmitted”; § A.1, text below equation (23): “Observe that for any Compk ∈ {Topk, Randk}, Compk(x) is a length-d vector, but only (at most) k of its components are non-zero. This implies that, by treating Compk(x) a length-k vector whose entries correspond to the k non-zero entries of x” (page 20, top paragraph).], thereby teaching the limitation of “performing random perc sign compression” [As noted above, the combination of sparsification with the vectors in Pezeshki results in sparsifying the sign vector, thereby arriving at the instant limitation.]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of the references combined thus far with the teachings of Basu by applying the Randk sparsification technique to the sign vectors so as to arrive at the limitations of the instant dependent claim. Since the teachings of Basu discussed above are part of the sparsification technique discussed in the rejection of parent independent claim, the motivation for doing so is the same as the motivation given for Basu in the rejection of the parent independent claim.
As to claim 4, the combination of Pezeshki, Turkson, and Basu teaches the method as recited in claim 3, as set forth above.
Basu further teaches “wherein the signs removed from the first vector are removed based on a user-defined parameter (α) that provides a reduction factor applied to the vector.” [As noted above, the Randk is performed by choosing k components uniformly at random. Here, it is understood one of ordinary skill in the art that “k” is a parameter that is defined by a user because the context indicates that k is necessarily defined as some specific value when the method is performed. As an example, see § 5.2.2: “We use… where … k = 40 is the sparsity.” Note that while the term “user” does not appear, the aspect of the user is referred to by “we,” which describes the authors as a user performing the described experiments.]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of the references combined thus far, including the above teachings of Basu, so as to arrive at the limitations of the instant dependent claim. Since the teachings of Basu discussed above are part of the sparsification technique discussed in the rejection of parent independent claim, the motivation for doing so is the same as the motivation given for Basu in the rejection of the parent independent claim.
As to claim 8, the combination of Pezeshki, Turkson, and Basu teaches the method as recited in claim 1, wherein each gradient corresponds to a respective aspect of a configuration, update, and/or operation, of the model instance. [Pezeshki, [0062]: “A stochastic gradient descent (SGD) algorithm may be used to optimize the model parameters w(n). The first communication manager 320 may perform one or more SGD procedures to determine the optimized parameters w(n) and may determine the gradients, gk(n)∇Fk(w(n)), of the loss function Fk(w).” That is, the gradients correspond to the operation of training the model instance using stochastic gradient descent.]
As to claim 9, the combination of Pezeshki, Turkson, and Basu teaches the method as recited in claim 1, wherein the operations further comprise receiving, by the edge node from the central node, an updated central model that was created in part based on the first or second vector sent by the edge node to the central node. [Pezeshki, [0056]: “As shown by reference number 315, the base station 305 may transmit a machine learning component to each of the UEs 310.” Pezeshki, [0067]: “Federated learning involves repeated updates to a machine learning component. Local updates are transmitted from UEs to a base station. Hundreds or thousands of federated learning rounds may be performed before convergence is achieved.” See also [0066]: “As shown by reference number 350, the second communication manager 325 may update the global machine learning component based on the aggregated updates…. The second communication manager 325 may update the global machine learning component using multiple rounds of updates from the UEs 310… In some aspects, the base station 305 may transmit an update associated with the updated global machine learning component to the UEs 310.” That is, each subsequent round is based on the local updates (including the compressed vector) sent in a previous round.]
As to claims 11-14, 16, and 18-19, these claims are directed to a computer readable medium for performing operations that are the same or substantially the same as those of claims 1-4, 6, and 18-19. Therefore, the rejections made to those previous claims are applied to claims 11-12, 16, and 18-19, respectively.
Furthermore, Pezeshki teaches “a non-transitory storage medium having stored therein instructions that are executable by one or more hardware processors to perform operations” [[0010]: “a non-transitory computer-readable medium storing a set of instructions for wireless communication includes one or more instructions that, when executed by one or more processors of a UE, cause the UE to”]. The Examiner notes that similar to claim 1, the claim language of “either…or…” in this claim is interpreted as an alternative expression within the definition of the operations and not requiring system capability of performing both cases.
2. Claims 5 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Pezeshki in view of Turkson and Basu, and further in view of Wangni et al., “Gradient Sparsification for Communication-Efficient Distributed Optimization,” arXiv:1710.09854v1 [cs.LG] 26 Oct 2017 (“Wangni”).
As to claim 5, the combination of Pezeshki, Turkson, and Basu teaches the method as recited in claim 3, as set forth above, but does not explicitly teach the further limitations of the instant dependent claim.
Wangni teaches “wherein signs remaining in the first vector maintain a same order as in a uncompressed vector.” [§ 3.3 (coding strategy): “Once we have computed a sparsified gradient vector Q(g), we need to pack the resulting vector into a message for transmission… Therefore, to represent QB(g), we only need one floating point 1/λ, plus the non-zero coordinates i and its sign sign(gi). Here we give an example about the format,” As shown in the illustrations below this paragraph, the sparsified vector maintains the same indices (and thus the order) of the remaining as the original non-sparsified vector, i.e., the indices of 1, 4, 5, and 6, and the compressed vectors QA and QB maintain the order, e.g., the order of indices 4 and 6 in QB. Note that QB, which is a compressed representation of a sign compression is a suitable representation of the vectors at issue. The instant claim would be a reduced situation in which the vector QA is not needed.]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of the references combined thus far with the teachings of Wangni in formatting sparsified and compressed vectors so as to arrive at the limitations of the instant claim. The motivation would have been to a format for the output vector that is suitable “to pack the resulting vector into a message for transmission” (Wangni, § 3.3, paragraph 1) in a manner that is efficient in terms of communication (see Wangni, abstract: “we propose a convex optimization formulation to minimize the coding length of stochastic gradients.”).
As to claim 15, the rejection made to claim 5 is applied to claim 15, which recite the same or substantially the same further limitations.
3. Claims 6 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Pezeshki in view of Turkson and Basu, and further in view of Xue et al. (US 2020/0167639 A1) (“Xue”).
As to claim 6, the combination of Pezeshki, Turkson, and Basu teaches the method as recited in claim 1, as set forth above, but does not explicitly teach the further limitation that Xue further teaches
Xue teaches “wherein presence, or lack, of overfitting is determined based on a slope of a linear regression that includes validation data points generated at the edge node.” [Xue, [0030]: “The present embodiments perform the analysis of the behavior of the training accuracy curve based on the slope of the curve in the windows 304 and the maximum slope of the curve after the maximum value.” Xue, [0031]: “Block 402 determines the slope ki of each window i from the windows 304. This slope can be determined by any appropriate process, for example through linear regression.” In regards to the aspect of “validation data points,” Xue, [0029] teaches that in the curve of FIG. 3, “the vertical axis representing a loss function or errors and with the horizontal axis representing training time or a quantity of training data.” This loss function includes validation data points as disclosed in Xue, [0052]: “The training process uses part of the training data to train the network and another part of the training data to perform validation of the trained neural network's outputs. A training status module 810 monitors the status of the training process by measuring the accuracy of the neural network using, e.g., a loss function over time.” That is, the accuracy of the curve is accuracy of the outputs (see also Xue, [0019]), which is determined based on the validation data. Furthermore, as shown in FIG. 6 and the associated descriptions of Xue, [0035]-[0037], the determination of overfitting in blocks 608 and 624 is based on the accuracy curve.]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have further combined the teachings of Pezeshki and Xue so as to arrive at the limitations of the instant dependent claim. The motivation for doing so would have been to implement a manner of measuring accuracy changes during training, which is a factor that can be used to assess overfitting (see Xue, [0037]: “If block 608 found that the trend of the training accuracy curve is upward, block 624 indicates that the training process is overfitting the data.”; [0039]: “Block 704 determines the status of the training process in the manner described above. The status can be checked at any appropriate interval.”; [0040]: “In the case of divergence, overfitting, and underfitting, block 706 adjusts parameters of the training process to improve the outcome.”).
As to claim 16, the rejection made to claim 6 is applied to claim 16, which recites the same or substantially the same further limitations.
4. Claims 7, 10, 17 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Pezeshki in view of Turkson and Basu Xue, and further in view of Wangni et al., “Gradient Sparsification for Communication-Efficient Distributed Optimization,” arXiv:1710.09854v1 [cs.LG] 26 Oct 2017 (“Wangni”).
As to claim 7, the combination of Pezeshki, Turkson, and Basu teaches the method as recited in claim 1, but does not teach the further limitations of the instant dependent claim 7.
Zheng teaches the further limitations of “wherein when sign compression is performed, a scaling factor is applied to signs in the first vector.” [Abstract: “By partitioning the gradient into blocks, a blockwise compressor is introduced such that each gradient block is compressed and transmitted in 1-bit format with a scaling factor, leading to a nearly 32x reduction on communication.” § 3.2, paragraph 2: “Specifically, we partition the compressor input v into B blocks, where each block b has db elements indexed by Gb. Block b is then compressed with scaling factor ||vGb ||1/db (where vGb is the subvector of v with elements in block b).” As shown in Algorithm 3, line 6, the scaling factor is applied to sign(pt, i, G), which is analogous to the resulting compressed vector.].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of the references combined thus far with the teachings of Zheng by implementing the scaling technique of Zheng so as to arrive at the limitation that “when sign compression is performed, a scaling factor is applied to signs in the resulting compressed vector.” The motivation would have been to improve upon the SignSGD technique by taking into account the magnitude of the gradients, as suggested by Zheng (see § 3.2, paragraph 1: “Compared to using only the sign operator as in signSGD, the factor ||v||1/d can preserve the gradient’s Magnitude,” which describes the technique of using a scaling factor in general to improve upon signSGD.).
As to claim 10, the combination of Pezeshki, Turkson, and Basu teaches the method as recited in claim 1, but does not teach the further limitations of the instant dependent claim 10.
Zheng teaches the further limitations of “wherein the first or second vector sent by the edge node to the central node is decompressible with a scaling factor.” [Abstract: “By partitioning the gradient into blocks, a blockwise compressor is introduced such that each gradient block is compressed and transmitted in 1-bit format with a scaling factor, leading to a nearly 32x reduction on communication.” § 3.2, paragraph 2: “Specifically, we partition the compressor input v into B blocks, where each block b has db elements indexed by Gb. Block b is then compressed with scaling factor ||vGb ||1/db (where vGb is the subvector of v with elements in block b).” As shown in Algorithm 3, line 6, the scaling factor is applied to sign(pt, i, G), which is analogous to the resulting compressed vector and is data pushed (sent) by the worker (edge node) to the server (central node), as indicated by the description of “transmitted in 1-bit format” in the abstract. In regards to the aspect of “decompressible,” the application of the scaling factor to the 1-bit sign data constitutes decompression, since it recovers the magnitude on the basis of the scaling factor.].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of the references combined thus far with the teachings of Zheng by implementing the scaling technique of Zheng so as to arrive at the limitation that “the compressed vector sent by the edge node to the central node is decompressible with a scaling factor.” The motivation would have been to improve upon the SignSGD technique by taking into account the magnitude of the gradients, as suggested by Zheng (see § 3.2, paragraph 1: “Compared to using only the sign operator as in signSGD, the factor ||v||1/d can preserve the gradient’s Magnitude,” which describes the technique of using a scaling factor in general to improve upon signSGD.).
As to claims 17 and 20, the further limitations of these claims are the same or substantially the same as those of claims 3-4. Therefore, the rejections made for claims 3-4 are applied to claims 13-14.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. The following reference depicts the state of the art.
Sun et al., “meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting,” Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017 (“Sun”) teaches that gradient scarification achieves effects similar to dropout in avoiding overfitting (see § 1, paragraph 4), evidencing that gradients sparsification and the dropout are both are in the general category of sparsity techniques for avoiding or mitigating the effects of overfitting.
Thakkar et al. (US 20220383204 A1) teaches the use of both gradient sparsification and sign gradient descent in a federated learning system.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to YAO DAVID HUANG whose telephone number is (571)270-1764. The examiner can normally be reached Monday - Friday 9:00 am - 5:30 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached at (571) 270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Y.D.H./Examiner, Art Unit 2124
/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124