Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Remarks
This Office Action is responsive to Applicants' Amendment filed on March 4, 2026, in which claims 1-14 and 16-20 are currently amended. Claims 1-20 are currently pending.
Drawings
Applicant's amendments made to the drawings are acknowledged. Examiner’s objection to the drawings are hereby withdrawn, as necessitated by Applicant’s amendments made to the drawings.
Specification
Applicant's amendments made to the specification are acknowledged. Examiner’s objection to the specification are hereby withdrawn, as necessitated by Applicant’s amendments made to the specification.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on November 8, 2019 are in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Response to Arguments
The rejections to claim 14 under 35 U.S.C. § 112(b) are hereby withdrawn, as necessitated by applicant's amendments and remarks made to the rejections.
Applicant’s arguments with respect to rejection of claims 10 under 35 U.S.C. 112(b) based on amendment have been considered, however, are not persuasive.
Applicant’s arguments on p. 11 of the Remarks submitted 3/4/2026 do not address the indefinite language in claim 10. Specifically, “Update the first parameter update tree that tracks the prediction residuals with calling a predictive residual encoding encoder to calculate the prediction residuals” is grammatically ambiguous. It’s unclear what is meant by “with calling”, if it’s meant to mean “alongside calling”, “by calling”, or something else together. It’s further unclear what “with calling” modifies (the updating, the tracking, the calculating) and what the relationship between “with calling a predictive residual encoding” and “to calculate the prediction residuals” is. For at least these reasons Examiner asserts that it is reasonable and appropriate to maintain the rejection of claim 10 under 35 U.S.C. 112(b).
Applicant’s arguments with respect to rejection of claims 1-20 under 35 U.S.C. 101 based on amendment have been considered, however, are not persuasive.
With respect to Applicant's arguments on pp. 12-14 of the Remarks submitted 3/4/2026 that the claims are eligible in view of the 2024 SME guidance, Examiner respectfully disagrees. First, Examiner notes that the instant claims are of significantly different scope than any of the claims in Example 47-49. Second, Examiner notes that the 2024 SME guidance states that Example 47 Claim 2 is ineligible because it recites concepts tied to mathematical calculations and relationships. Similarly, the instant claims are explicitly tied to mathematical calculations ([¶0046] "The encoder may finetune the neural network filter by using the ground-truth data which is available at encoder side (the uncompressed data). Finetuning may be performed in order to improve the neural network filter when applied to the current input data, such as to one or more video frames. Finetuning may comprise running one or more optimization iterations on some or all the learnable weights of the neural network filter. An optimization iteration may comprise computing gradients of a loss function with respect to some or all the learnable weights of the neural network filter, for example by using the backpropagation algorithm, and then updating the some or all learnable weights by using an optimizer, such as the stochastic gradient descent optimizer. The loss function may comprise one or more loss terms. One example loss term may be the mean squared error (MSE). Other distortion metrics may be used as the loss terms"). In view of the amended claim limitations and the instant specification Examiner asserts that it is clear the claim as a whole is directed towards mental processes and mathematical calculations and relationships. For at least these reasons and those further detailed below, Examiner asserts that it is reasonable and appropriate to maintain the rejection under 35 U.S.C. 101.
Applicant’s arguments with respect to rejection of claims 1-20 under 35 U.S.C. 103 based on amendment have been considered.
With respect to Applicant's arguments on p. 17 of the Remarks submitted 3/4/2026 that "Liu does not disclose that the residuals are prediction residuals", Examiner respectfully disagrees. Liu does predictive compression by having each worker maintain a predictor state hki based on past gradients. Instead of compressing the full gradient update the worker forms a gradient residual, compresses that residual, and then updates the state with the compressed residual, which makes hki behave like an ([p. 4] "exponential moving average of the local gradient in expectation"). In other words, it's predictive compression because the algorithm first predicts the current gradient from past gradients (via a running state like an exponential moving average) and then compresses only the residual between the prediction and the true update.
With respect to Applicant's arguments on p. 17 of the Remarks submitted 3/4/2026 that "Liu does not disclose [...] that the weight updates are parameter updates of at least one neural network parameter", Examiner respectfully disagrees. In Liu the weight parameter updates are explicitly parameter updates of vector/tensor xk+1 in vector space Rd ([p. 1] "the master node further takes the gradient descent step with the averaged gradient and broadcasts the new model parameter xk+1 to all worker nodes" [p. 3] "A vector x [...] a vector x ∈ Rd can be further decomposed into blocks, i.e., x = (x(1) ,x(2) ,··· ,x(m) ) with x(l) ∈ Rdl and m l=1 dl = d, and the blocks can be compressed independently" [p. 7] "Figures 4 and 5 show the training loss and test loss for each epoch during the training of LeNet on the MNIST dataset and Resnet18 on CIFAR10 dataset").
With respect to Applicant's arguments on p. 17 of the Remarks submitted 3/4/2026 that "Liu does not describe predictive compression", Examiner respectfully disagrees. Liu does predictive compression by having each worker maintain a predictor state hki based on past gradients. Instead of compressing the full gradient update the worker forms a gradient residual, compresses that residual, and then updates the state with the compressed residual, which makes hki behave like an ([p. 4] "exponential moving average of the local gradient in expectation"). In other words, it's predictive compression because the algorithm first predicts the current gradient from past gradients (via a running state like an exponential moving average) and then compresses only the residual between the prediction and the true update.
The remaining arguments are moot in view of a new ground of rejection set forth below.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claim 10 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Regarding claim 10, “Update the first parameter update tree that tracks the prediction residuals with calling a predictive residual encoding encoder to calculate the prediction residuals” is grammatically ambiguous. It’s unclear what is meant by “with calling”, if it’s meant to mean “alongside calling”, “by calling”, or something else together. It’s further unclear what “with calling” modifies (the updating, the tracking, the calculating) and what the relationship between “with calling a predictive residual encoding” and “to calculate the prediction residuals” is. In the interest of further Examination the claim limitation is interpreted as updating by calling a predictive residual encoding decoder and updating the calculate the weight updates.
Claim Rejections - 35 USC § 101
101 Rejection
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-20 are rejected under 35 USC § 101 because the claimed invention is directed to non-statutory subject matter.
Regarding Claim 1: Claim 1 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 1 is directed to an apparatus, which is directed to a product, one of the statutory categories.
Step 2A Prong One Analysis: Claim 1 under its broadest reasonable interpretation is a series of mental processes. For example, but for the generic computer components language, the above limitations in the context of this claim encompass neural network processing, including the following:
maintain a first parameter update tree that tracks prediction residuals of weight updates of a machine learning model; wherein the weight updates are parameter updates of at least one neural network parameter tensor (observation, evaluation, and judgement),
maintain a second parameter update tree that tracks the weight updates of the machine learning model (observation, evaluation, and judgement)
determine whether to signal to a decoder the first bitstream generated for the prediction residuals or the second bitstream generated for the weight updates; wherein the first parameter update tree and the second parameter update tree are used to produce alternative compressed representations of the same weight updates that are the parameter updates of the at least one neural network parameter tensor. (observation, evaluation, and judgement)
Therefore, claim 1 recites an abstract idea which is a judicial exception.
Step 2A Prong Two Analysis: Claim 1 recites additional elements “at least one processor and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to”. However, these additional features are computer components recited at a high-level of generality, such that they amount to no more than mere instructions to apply the judicial exception using a generic computer component. An additional element that merely recites the words “apply it” (or an equivalent) with the judicial exception, or merely includes instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea, does not integrate the judicial exception into a practical application (See MPEP 2106.05(f)). Claim 1 also recites additional elements “pass the first parameter update tree and the prediction residuals to an encoder configured to perform predictive compression of the weight updates using the prediction residuals”, “receive a first bitstream generated for the prediction residuals from the encoder”, “pass the second parameter update tree and the weight updates to the encoder”, and “receive a second bitstream generated for the weight updates from the encoder” which amounts to gathering and outputting data which is insignificant extra-solution activity (See MPEP 2106.05(g)). Therefore, claim 1 is directed to a judicial exception.
Step 2B Analysis: Claim 1 does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the lack of integration of the abstract idea into a practical application, the additional elements recited in claim 1 amount to no more than mere instructions to apply the judicial exception using a generic computer component and insignificant extra-solution activity. The gathering and outputting of data is considered well-understood, routine, and conventional in the art (See MPEP 2106.05(d)(II)).
For the reasons above, claim 1 is rejected as being directed to non-patentable subject matter under §101. This rejection applies equally to independent claim 17, which recites a method, as well as to dependent claims 2-7 and 18-20. The additional limitations of the dependent claims are addressed briefly below:
Dependent claims 2 and 18 recites additional observation, evaluation, and judgement “compare a size of the first bitstream generated for the prediction residuals to a size of the second bitstream generated for the weight updates”, “signal the first bitstream for the prediction residuals to a decoder, in response to the first size being less than the second size”, and “signal the second bitstream generated for the weight updates to the decoder, in response to the second size being less than or equal to the first size”
Dependent claims 3 and 19 recites additional observation, evaluation, and judgement “define an encoding flag configured to signal the first bitstream or the second bitstream to the decoder, wherein the encoding flag comprises a value of 1 when the first bitstream for the prediction residuals is signaled to the decoder, and the encoding flag comprises a value of 0 when the second bitstream for the weight updates is signaled to the decoder”
Dependent claims 4 and 20 recites additional insignificant extra-solution activity “determine whether an encoded prediction residual is lossy”, “determine whether an encoded weight update is lossy”, and “determine whether to signal to the decoder the first bitstream generated for the prediction residuals or the second bitstream generated for the weight updates, when the encoded weight update and/or the encoded prediction residual is lossy, based on at least one of: a first bitrate of the encoded prediction residual; a second bitrate of the encoded weight update; a first performance value computed based at least on a decoded prediction residual; or a second performance value computed based at least on a decoded weight update”
Dependent claim 5 recites additional observation, evaluation, and judgement “compute a first rate distortion value corresponding to an encoded prediction residual”, “compute a second rate distortion value corresponding to an encoded weight update”, “signal the encoded prediction residual, in response to the first rate distortion value being less than the second rate distortion value”, and “signal the encoded weight update, in response to the second rate distortion value being less than or equal to the first rate distortion value”
Dependent claim 6 recites additional observation, evaluation, and judgement “determine a first reconstruction accuracy of an encoded prediction residual”, “determine a second reconstruction accuracy an encoded weight update”, “signal the encoded prediction residual, in response to the first reconstruction accuracy being greater than the second reconstruction accuracy”, and “signal the encoded weight update, in response to the second reconstruction accuracy being greater than or equal to the first reconstruction accuracy”
Dependent claim 7 also recites additional observation, evaluation, and judgement “determine a first accuracy of a neural network on a validation dataset based on an encoded prediction residual; determine a second accuracy of the neural network on the validation dataset based on an encoded weight update; signal the encoded prediction residual, in response to the first accuracy being greater than the second accuracy; and signal the encoded weight update, in response to the second accuracy being greater than the first accuracy.”
Regarding Claim 8: Claim 8 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 8 is directed to an apparatus, which is directed to a product, one of the statutory categories.
Step 2A Prong One Analysis: Claim 8 under its broadest reasonable interpretation is a series of mental processes. For example, but for the generic computer components language, the above limitations in the context of this claim encompass neural network processing, including the following:
maintain a first parameter update tree that tracks prediction residuals of weight updates of a machine learning model; wherein the weight updates are parameter updates of at least one neural network parameter tensor; (observation, evaluation, and judgement),
maintain a second parameter update tree that tracks the weight updates of the machine learning model; wherein the prediction residuals are used for predictive compression of the weight updates; (observation, evaluation, and judgement)
update the first parameter update tree that tracks the prediction residuals, in response to the bitstream comprising the encoded prediction residuals (observation, evaluation, and judgement)
update the second parameter update tree that tracks the weight updates, in response to the bitstream comprising the encoded weight updates (observation, evaluation, and judgement)
wherein the first parameter update tree and the second parameter update tree are used to produce alternative compressed representations of the same weight updates that are the parameter updates of the at least one neural network parameter tensor (intended use of the update trees maintained through observation, evaluation, and judgement)
Therefore, claim 8 recites an abstract idea which is a judicial exception.
Step 2A Prong Two Analysis: Claim 8 recites additional elements “at least one processor and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to”. However, these additional features are computer components recited at a high-level of generality, such that they amount to no more than mere instructions to apply the judicial exception using a generic computer component. An additional element that merely recites the words “apply it” (or an equivalent) with the judicial exception, or merely includes instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea, does not integrate the judicial exception into a practical application (See MPEP 2106.05(f)). Claim 8 also recites additional elements “receive a bitstream, the bitstream comprising encoded prediction residuals or encoded weight updates” which amounts to gathering and outputting data which is insignificant extra-solution activity (See MPEP 2106.05(g)). Therefore, claim 8 is directed to a judicial exception.
Step 2B Analysis: Claim 8 does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the lack of integration of the abstract idea into a practical application, the additional elements recited in claim 8 amount to no more than mere instructions to apply the judicial exception using a generic computer component and insignificant extra-solution activity. The gathering and outputting of data is considered well-understood, routine, and conventional in the art (See MPEP 2106.05(d)(II)).
For the reasons above, claim 8 is rejected as being directed to non-patentable subject matter under §101. This rejection applies equally to dependent claims 9-16. The additional limitations of the dependent claims are addressed briefly below:
Dependent claims 9 recites additional observation, evaluation, and judgement “define a decoding flag comprising a value based on an encoding flag received from an encoder; wherein the decoding flag comprises a value of 1 when the bitstream comprises the encoded prediction residuals, and the decoding flag comprises a value of 0 when the bitstream comprises the encoded weight updates”
Dependent claim 10 recites additional observation, evaluation, and judgement “update the first parameter update tree that tracks the prediction residuals with calling a predictive prediction residual encoding encoder to calculate the prediction residuals of the weight updates, in response to the bitstream comprising the encoded weight updates, due to the prediction residuals not being available within a deep context-adaptive binary arithmetic coding decoder; and call a predictive residual encoding decoder, calculate the weight updates using the predictive residual encoding decoder, and update the second parameter update tree that tracks the weight updates the weight updates that are calculated using the predictive residual encoding decoder, in response to the bitstream comprising the encoded prediction residuals, due to the weight updates not being available within the deep context-adaptive binary arithmetic coding decoder”
Dependent claim 11 recites additional observation, evaluation, and judgement “determine whether a prediction residual of a parameter has been skipped, in response to the bitstream comprising the encoded prediction residuals”
Dependent claim 12 recites additional observation, evaluation, and judgement “determine whether a previous weight update is available, in response to determining that the prediction residual of the parameter has been skipped; determine a current weight update to be a previous weight update, in response to determining that the previous weight update is available; and determine the current weight update to be zero, in response to determining that the previous weight update is not available”
Dependent claim 13 recites additional observation, evaluation, and judgement “determine whether a previous weight update is available, in response to determining that the prediction residual of the parameter has not been skipped; determine a current weight update to be a previous weight update added to the prediction residual, in response to determining that the previous weight update is available; and determine the current weight update to be the prediction residual, in response to determining that the previous weight update is not available”
Dependent claim 14 recites additional observation, evaluation, and judgement “determine a weight update of a neural network parameter tensor, wherein the weight update of the neural network parameter tensor comprises one or more values; determine whether at least one value of the one or more values of the weight update of the neural network parameter tensor is nonzero; update the second parameter update tree corresponding to the weight update of the neural network parameter tensor, in response to determining that there is at least one nonzero value of the one or more values of the weight update of the neural network parameter tensor; and skip the neural network parameter tensor with removing the neural network parameter tensor from a list of neural network parameter tensors, in response determining that each value of the one or more values of weight update of the neural network parameter tensor is zero”
Dependent claim 15 recites additional observation, evaluation, and judgement “loop over a set of parameters to determine the current weight update for the set of parameters”
Dependent claim 16 recites additional observation, evaluation, and judgement “maintain a parameter update tree with metadata corresponding to a type of data; wherein the type of data comprises at least one prediction residual of current and past communications; wherein the type of data comprises at least one weight update of current and past communications; and reconstruct, using the metadata, the at least one weight update corresponding to a current state with parsing the parameter update tree and using the at least one prediction residual until a last available weight update”
Therefore, when considering the elements separately and in combination, they do not add significantly more to the inventive concept. Accordingly, claims 1-20 are rejected under 35 U.S.C. § 101.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1, 8, 10, 16, and 17 are rejected under U.S.C. §103 as being unpatentable over the combination of Liu (“A Double Residual Compression Algorithm for Efficient Distributed Learning”, 2020) and Gajjala (“Huffman Coding Based Encoding Techniques for Fast Distributed Deep Learning”, 2020).
PNG
media_image1.png
472
806
media_image1.png
Greyscale
FIG. 1 of Liu
Regarding claim 1, Liu teaches An apparatus comprising: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to:([p. 7 §5.2] "we use 1 parameter server and 10 workers, each of which is equipped with an NVIDIA Tesla K80 GPU.")
maintain a first parameter update tree that tracks prediction residuals of weight updates of a machine learning model;([p. 4] "each worker node can keep a state variable hki to track its previous gradient information" See also FIG. 1 gradient residual. Gradient residual buffer ∆ki on each worker interpreted as first parameter update tree that tracks residuals of weight updates (hki). See also Algorithm 1 line 5.)
wherein the weight updates are parameter updates of at least one neural network parameter tensor;([p. 1] "the master node further takes the gradient descent step with the averaged gradient and broadcasts the new model parameter xk+1 to all worker nodes" [p. 3] "A vector x [...] a vector x ∈ Rd can be further decomposed into blocks, i.e., x = (x(1) ,x(2) ,··· ,x(m)) with x(l) ∈ Rdl and m l=1 dl = d, and the blocks can be compressed independently" [p. 7] "Figures 4 and 5 show the training loss and test loss for each epoch during the training of LeNet on the MNIST dataset and Resnet18 on CIFAR10 dataset" In Liu the weight parameter updates are explicitly parameter updates of vector/tensor xk+1 in Rd)
maintain a second parameter update tree that tracks the weight updates of the machine learning model;([p. 4] "the master node broadcasts the com pressed model residual with error compensation (ˆqk) to all worker nodes and updates the model […] each worker node receives the compressed model residual (ˆqk) and updates its model xki" See also Algorithm 1 l. 16 and 18. Model residual qk interpreted as weight update of the machine learning model, model xk (or worker coy x^ki) interpreted as second parameter update tree)
pass the first parameter update tree and the prediction residuals to an encoder configured to perform predictive compression of the weight updates using the prediction results;([p. 4] "with DORE, we only need to transmit 2(32d b + 3 2d) bits per iteration. Thus DORE can reduce over 95% of the total communication by compressing both the gradient and model transmission" See Algorithm 1 l. 5-6 gradient residual buffer ∆ki (first parameter update tree) comprising gradient residual (hki) is passed to encoder Q. Liu does predictive compression by having each worker maintain a predictor state hki based on past gradients. Instead of compressing the full gradient update the worker forms a gradient residual, compresses that residual, and then updates the state with the compressed residual, which makes hki behave like an ([p. 4] "exponential moving average of the local gradient in expectation"). In other words, it's predictive compression because the algorithm first predicts the current gradient from past gradients (via a running state like an exponential moving average) and then compresses only the residual between the prediction and the true update)
receive a first bitstream generated for the prediction residuals from the encoder;([p. 4] "each worker node sends the compressed gradient residual (ˆ ∆k i) to the master node and updates its state hk i with ˆ ∆k" [p. 4] "with DORE, we only need to transmit 2(32d b + 3 2d) bits per iteration. Thus DORE can reduce over 95% of the total communication by compressing both the gradient and model transmission" Compressed gradient residual interpreted as first bitstream generated for the residuals from the encoder Q)
pass the second parameter update tree and the weight updates to the encoder;([p. 5] "Compression: ˆqk=Q(qk)" model residual (weight updates of machine learning model) comprising xk (second parameter update tree) explicitly passed to encoder Q)
receive a second bitstream generated for the weight updates from the encoder; and([p. 4] "with DORE, we only need to transmit 2(32d b + 3 2d) bits per iteration. Thus DORE can reduce over 95% of the total communication by compressing both the gradient and model transmission")
determine whether to signal to a decoder the first bitstream generated for the prediction residuals or the second bitstream generated for the weight updates.([p. 4] "each worker node sends the compressed gradient residual (ˆ ∆k i) to the master node […] the master node broadcasts the compressed model residual with error compensation (ˆqk) to all worker nodes and updates the model" See algorithm 1 l. 10 and 11 where the system decides whether to signal either the first or second bitstreams).
However, Liu does not explicitly teach wherein the first parameter update tree and the second parameter update tree are used to produce alternative compressed representations of the same weight updates that are the parameter updates of the at least one neural network parameter tensor..
Gajjala, in the same field of endeavor, teaches wherein the first parameter update tree and the second parameter update tree are used to produce alternative compressed representations of the same weight updates that are the parameter updates of the at least one neural network parameter tensor.([pp. 21-22] "each worker maintains a local copy of the same DNN model [...] we propose three encoding techniques—(a) run-length Huffman (RLH) encoding, (b)sample Huffman (SH) encoding, and (c) sample Huffman with sparsity (SHS); to encode the quantized gradients generated from code book quantization." See FIG. 1 which shows that the encoding methods are tree structures. Gajjala teaches applying first and second Huffman-tree based encoders to the same distributed DNN update payload, thereby producing alternative compressed representations of the neural network weight updates used to update model parameters. See also Algorithm 1).
Liu as well as Gajjala are directed towards neural network weight update compression techniques. Therefore, Liu as well as Gajjala are analogous art in the same field of endeavor. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Liu with the teachings of Gajjala by applying multiple tree based compression schemes that produce alternative representations of the weight updates. Gajjala provides as additional motivation for combination ([p. 25] “RLH, SHS and SH can improve the compressed volume by up to 5.1×, 4.32× and 3.8×”). This motivation for combination also applies to the remaining claims which depend on this combination.
Regarding claim 8, Liu teaches An apparatus comprising: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to:([p. 7 §5.2] "we use 1 parameter server and 10 workers, each of which is equipped with an NVIDIA Tesla K80 GPU.")
maintain a first parameter update tree that tracks prediction residuals of weight updates of a machine learning model;([p. 4] "each worker node can keep a state variable hki to track its previous gradient information" See also FIG. 1 gradient residual. Gradient residual buffer ∆ki on each worker interpreted as first parameter update tree that tracks residuals of weight updates (hki). See also Algorithm 1 line 5.)
wherein the weight updates are parameter updates of at least one neural network parameter tensor;([p. 1] "the master node further takes the gradient descent step with the averaged gradient and broadcasts the new model parameter xk+1 to all worker nodes" [p. 3] "A vector x [...] a vector x ∈ Rd can be further decomposed into blocks, i.e., x = (x(1) ,x(2) ,··· ,x(m)) with x(l) ∈ Rdl and m l=1 dl = d, and the blocks can be compressed independently" [p. 7] "Figures 4 and 5 show the training loss and test loss for each epoch during the training of LeNet on the MNIST dataset and Resnet18 on CIFAR10 dataset" In Liu the weight parameter updates are explicitly parameter updates of vector/tensor xk+1 in Rd)
maintain a second parameter update tree that tracks the weight updates of the machine learning model;([p. 4] "the master node broadcasts the com pressed model residual with error compensation (ˆqk) to all worker nodes and updates the model […] each worker node receives the com
pressed model residual (ˆqk) and updates its model xki" See also Algorithm 1 l. 16 and 18. Model residual qk interpreted as weight update of the machine learning model, model xk (or worker coy x^ki) interpreted as second parameter update tree)
wherein the prediction residuals are used for predictive compression of the weight updates;([p. 4] "with DORE, we only need to transmit 2(32d b + 3 2d) bits per iteration. Thus DORE can reduce over 95% of the total communication by compressing both the gradient and model transmission" See Algorithm 1 l. 5-6 gradient residual buffer ∆ki (first parameter update tree) comprising gradient residual (hki) is passed to encoder Q. Liu does predictive compression by having each worker maintain a predictor state hki based on past gradients. Instead of compressing the full gradient update the worker forms a gradient residual, compresses that residual, and then updates the state with the compressed residual, which makes hki behave like an ([p. 4] "exponential moving average of the local gradient in expectation"). In other words, it's predictive compression because the algorithm first predicts the current gradient from past gradients (via a running state like an exponential moving average) and then compresses only the residual between the prediction and the true update)
receive a bitstream, the bitstream comprising encoded prediction residuals or encoded weight updates;([p. 4] "each worker node sends the compressed gradient residual (ˆ ∆k i) to the master node and updates its state hk i with ˆ ∆k" [p. 4] "with DORE, we only need to transmit 2(32d b + 3 2d) bits per iteration. Thus DORE can reduce over 95% of the total communication by compressing both the gradient and model transmission" Compressed gradient residual interpreted as first bitstream generated for the residuals from the encoder Q)
update the first parameter update tree that tracks the prediction residuals, in response to the bitstream comprising the encoded prediction residuals; and(See Algorithm 1 Algorithm 1 of Liu corresponds to one iteration of distributed SGD, FIG. 3 shows how many iterations are run, the first parameter update tree is updated at each iteration in response to the previous residuals being encoded)
update the second parameter update tree that tracks the weight updates, in response to the bitstream comprising the encoded weight updates.(See Algorithm 1 Algorithm 1 of Liu corresponds to one iteration of distributed SGD, FIG. 3 shows how many iterations are run, the second parameter update tree is updated at each iteration in response to the previous weight update being encoded).
However, Liu does not explicitly teach wherein the first parameter update tree and the second parameter update tree are used to produce alternative compressed representations of the same weight updates that are the parameter updates of the at least one neural network parameter tensor..
Gajjala, in the same field of endeavor, teaches wherein the first parameter update tree and the second parameter update tree are used to produce alternative compressed representations of the same weight updates that are the parameter updates of the at least one neural network parameter tensor.([pp. 21-22] "each worker maintains a local copy of the same DNN model [...] we propose three encoding techniques—(a) run-length Huffman (RLH) encoding, (b)sample Huffman (SH) encoding, and (c) sample Huffman with sparsity (SHS); to encode the quantized gradients generated from code book quantization." See FIG. 1 which shows that the encoding methods are tree structures. Gajjala teaches applying first and second Huffman-tree based encoders to the same distributed DNN update payload, thereby producing alternative compressed representations of the neural network weight updates used to update model parameters. See also Algorithm 1).
Liu as well as Gajjala are directed towards neural network weight update compression techniques. Therefore, Liu as well as Gajjala are analogous art in the same field of endeavor. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Liu with the teachings of Gajjala by applying multiple tree based compression schemes that produce alternative representations of the weight updates. Gajjala provides as additional motivation for combination ([p. 25] “RLH, SHS and SH can improve the compressed volume by up to 5.1×, 4.32× and 3.8×”). This motivation for combination also applies to the remaining claims which depend on this combination.
Regarding claim 10, the combination of Liu, and Gajjala teaches The apparatus of claim 8, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: update the first parameter update tree that tracks the prediction residuals with calling a predictive prediction residual encoding encoder to calculate the prediction residuals of the weight updates, in response to the bitstream comprising the encoded weight updates, due to the prediction residuals not being available within a deep context-adaptive binary arithmetic coding decoder; and(Liu [p. 4] "each worker node sends the compressed gradient residual (ˆ ∆k i) to the master node and updates its state hk i with ˆ ∆k" [p. 4] "with DORE, we only need to transmit 2(32d b + 3 2d) bits per iteration. Thus DORE can reduce over 95% of the total communication by compressing both the gradient and model transmission" Algorithm 1 of Liu corresponds to one iteration of distributed SGD, FIG. 3 shows how many iterations are run, the first parameter update tree is updated at each iteration in response to the previous residuals being encoded. Liu does not rely on a deep context-adaptive binary arithmetic coding decoder whatsoever such that the updates are interpreted as being due to the residuals not being available within a deep context-adaptive binary arithmetic coding decoder.)
call a predictive residual encoding decoder, calculate the weight updates using the predictive residual encoding decoder, and(Liu [p. 4] "with DORE, we only need to transmit 2(32d b + 3 2d) bits per iteration. Thus DORE can reduce over 95% of the total communication by compressing both the gradient and model transmission" See Algorithm 1 l. 5-6 gradient residual buffer ∆ki (first parameter update tree) comprising gradient residual (hki) is passed to encoder Q. Liu does predictive compression by having each worker maintain a predictor state hki based on past gradients. Instead of compressing the full gradient update the worker forms a gradient residual, compresses that residual, and then updates the state with the compressed residual, which makes hki behave like an ([p. 4] "exponential moving average of the local gradient in expectation"). In other words, it's predictive compression because the algorithm first predicts the current gradient from past gradients (via a running state like an exponential moving average) and then compresses only the residual between the prediction and the true update)
update the second parameter update tree that tracks the weight updates with the weight updates call a predictive residual encoding decoder, calculate the weight updates using the predictive residual encoding decoder, in response to the bitstream comprising the encoded prediction residuals, due to the weight updates not being available within the deep context-adaptive binary arithmetic coding decoder.(Liu [p. 4] "each worker node sends the compressed gradient residual (ˆ ∆k i) to the master node and updates its state hk i with ˆ ∆k" [p. 4] "with DORE, we only need to transmit 2(32d b + 3 2d) bits per iteration. Thus DORE can reduce over 95% of the total communication by compressing both the gradient and model transmission" Algorithm 1 of Liu corresponds to one iteration of distributed SGD, FIG. 3 shows how many iterations are run, the second parameter update tree is updated at each iteration in response to the previous weight update being encoded. Liu does not rely on a deep context-adaptive binary arithmetic coding decoder whatsoever such that the updates are interpreted as being due to the residuals not being available within a deep context-adaptive binary arithmetic coding decoder.).
Regarding claim 16, the combination of Liu and Gajjala teaches The apparatus of claim 8, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: maintain a parameter update tree with metadata corresponding to a type of data;(Liu [p. 4] "each worker node can keep a state variable hki to track its previous gradient information" See also FIG. 1 and Algorithm 1. Gradient residual label/variable name (literally "gradient residual:") interpreted as metadata corresponding to a type of data.)
wherein the type of data comprises at least one prediction residual of current and past communications;(Liu [p. 4] "We also compensate the model residual compression error into next iteration to achieve a better convergence")
wherein the type of data comprises at least one weight update of current and past communications; and(Liu [p. 4] "each worker node sends the compressed gradient residual (ˆ ∆k i) to the master node and updates its state hk i with ˆ ∆k i;" [p. 4] "the master node broadcasts the compressed model residual with error compensation (ˆqk) to all worker nodes and updates the model;")
reconstruct, using the metadata, the at least one weight update corresponding to a current state with parsing the parameter update tree and using the at least one prediction residual until a last available weight update.(Liu [p. 4] "each worker node sends the compressed gradient residual (ˆ ∆k i) to the master node and updates its state hk i with ˆ ∆k i;" [p. 4] "the master node broadcasts the compressed model residual with error compensation (ˆqk) to all worker nodes and updates the model;" See also l. 11 of Algorithm 1 which is interpreted as the reconstructing the compressed update q^k. Line 11 of K=K-1 or last iteration of SGD interpreted as using the at least one residual until a last available weight update.).
Regarding claim 17, claim 17 is directed towards the method performed by claim 1. Therefore, the rejection applied to claim 1 also applies to claim 17.
Claims 2, 5, and 18 are rejected under U.S.C. §103 as being unpatentable over the combination of Liu and Gajjala and in further view of Jiang (“Model Pruning Enables Efficient Federated Learning on Edge Devices”, 2019).
Regarding claim 2, the combination of Liu and Gajjala teaches The apparatus of claim 1.
However, the combination of Liu and Gajjala doesn't explicitly teach wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: compare a size of the first bitstream generated for the prediction residuals to a size of the second bitstream generated for the weight updates; signal the first bitstream for the prediction residuals to a decoder, in response to the first size being less than the second size; and signal the second bitstream generated for the weight updates to the decoder, in response to the second size being less than or equal to the first size..
Jiang, in the same field of endeavor, teaches the instructions, when executed by the at least one processor, cause the apparatus at least to: compare a size of the first bitstream generated for the [prediction residuals] to a size of the second bitstream generated for the weight updates; signal the first bitstream for the prediction residuals to a decoder, in response to the first size being less than the second size; and signal the second bitstream generated for the weight updates to the decoder, in response to the second size being less than or equal to the first size.([p. 7 §5.2] "We implement two types of storage for sparse matrices: bitmap and value-index tuple. Bitmap uses one extra bit to indicate whether the specific value is zero. For 32 bit floating point parameter components, bitmap incurs 1/32 extra storage and communication overhead. Value-index tuple stores the values and both row and column indices of all non-zero entries. In our implementation, we use 16-bit integers to store row and column indices and 32-bit floating point numbers to store parameter values. Since each parameter component is associated with a row index and a column index, the storage and communication overhead doubles compared to storing the values only. We dynamically choose between the two ways of storage, and thus, the ratio of the sparse parameter size to the dense parameter size is min 2×d, 1 32 +d , where d is the model’s density (percentage of non-zero parameters).").
The combination of Liu and Gajjala as well as Jiang are directed towards federated deep learning. Therefore, the combination of Liu and Gajjala as well as Jiang are analogous art in the same field of endeavor. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of the combination of Liu and Gajjala with the teachings of Jiang by using the dynamic mapping function in Jiang as Q in the combination of Liu and Gajjala. Jiang provides as additional motivation for combination ([p. 8 §5.3] “To benefit from using sparse matrices in real systems, we extend the PyTorch library by implementing a more efficient sparse storage, and the support for sparse convolutional kernels in forward passes”). This motivation for combination also applies to the remaining claims which depend on this combination.
Regarding claim 5, the combination of Liu and Gajjala teaches The apparatus of claim 1.
However, the combination of Liu and Gajjala doesn't explicitly teach, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: compute a first rate distortion value corresponding to an encoded prediction residual;
compute a second rate distortion value corresponding to an encoded weight update;
signal the encoded prediction residual, in response to the first rate distortion value being less than the second rate distortion value; and signal the encoded weight update, in response to the second rate distortion value being less than or equal to the first rate distortion value.
Jiang, in the same field of endeavor, teaches the instructions, when executed by the at least one processor, cause the apparatus at least to: compute a first rate distortion value corresponding to an encoded [prediction residual];([p. 7 §5.2] "the ratio of the sparse parameter size to the dense parameter size is min 2×d, 1/32 +d" 2xd interpreted as first rate distortion value)
compute a second rate distortion value corresponding to an encoded weight update;([p. 7 §5.2] "the ratio of the sparse parameter size to the dense parameter size is min 2×d, 1/32 +d" 1/32+d interpreted as second rate distortion value)
signal the encoded prediction residual, in response to the first rate distortion value being less than the second rate distortion value; and signal the encoded weight update, in response to the second rate distortion value being less than or equal to the first rate distortion value.([p. 7 §5.2] "We implement two types of storage for sparse matrices: bitmap and value-index tuple. Bitmap uses one extra bit to indicate whether the specific value is zero. For 32 bit floating point parameter components, bitmap incurs 1/32 extra storage and communication overhead. Value-index tuple stores the values and both row and column indices of all non-zero entries. In our implementation, we use 16-bit integers to store row and column indices and 32-bit floating point numbers to store parameter values. Since each parameter component is associated with a row index and a column index, the storage and communication overhead doubles compared to storing the values only. We dynamically choose between the two ways of storage, and thus, the ratio of the sparse parameter size to the dense parameter size is min 2×d, 1 32 +d , where d is the model’s density (percentage of non-zero parameters)." Jiang explicitly selects the minimum for signaling).
The combination of Liu and Gajjala as well as Jiang are directed towards federated deep learning. Therefore, the combination of Liu and Gajjala as well as Jiang are analogous art in the same field of endeavor. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of the combination of Liu and Gajjala with the teachings of Jiang by using the dynamic mapping function in Jiang as Q in the combination of Liu and Gajjala. Jiang provides as additional motivation for combination ([p. 8 §5.3] “To benefit from using sparse matrices in real systems, we extend the PyTorch library by implementing a more efficient sparse storage, and the support for sparse convolutional kernels in forward passes”). This motivation for combination also applies to the remaining claims which depend on this combination.
Regarding claim 18, claim 18 is directed towards the method performed by claim 2. Therefore, the rejection applied to claim 2 also applies to claim 18.
Claims 3, 9, 11, 12, 13, 15, and 19 are rejected under U.S.C. §103 as being unpatentable over the combination of Liu and Gajjala and in further view of Samek (US20210065002A1).
Regarding claim 3, the combination of Liu and Gajjala teaches The apparatus of claim 1.
However, the combination of Liu and Gajjala doesn't explicitly teach wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: define an encoding flag configured to signal the first bitstream or the second bitstream to the decoder;
wherein the encoding flag comprises a value of 1 when the first bitstream for the prediction residuals is signaled to the decoder, and the encoding flag comprises a value of 0 when the second bitstream for the weight updates is signaled to the decoder.
Samek, in the same field of endeavor, teaches The apparatus of claim 1, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: define an encoding flag configured to signal the first bitstream or the second bitstream to the decoder;([¶0019] "based on the evaluation of the lossy coding of previous parameterization updates, it might be determined for each parameter of the parameterization, whether an update value is coded in a current parameterization update or not, i.e., is left uncoded. A flag may then be coded for each parameter to indicate whether for the respective parameter an update value is coded by the lossy coding by the current parameterization update, or not, and the flag may be coded using entropy coding using the determined probability of the respective parameter")
wherein the encoding flag comprises a value of 1 when the first bitstream for the prediction residuals is signaled to the decoder, and the encoding flag comprises a value of 0 when the second bitstream for the weight updates is signaled to the decoder.([¶0163] "using a flag or sign bit per coded update value" [¶0059] "The restriction does not significantly impact the learning convergence rate as non-transmitted update values due to being in the second but largest set of update values of opposite sign are likely to be transmitted in one of the cycles to come." [¶0166] " the information on which is also available to the corresponding probability estimation module 162′ at the receiver/decoder side. For instance, the probability estimation module 162 logs for each parameter 26 of parameterization 18, the membership of the corresponding coded value in the coded version 148 to the coded values 156 or the non-coded values 148, i.e., whether an update value is contained in the coded version 148 for the respective parameter 26 in a corresponding preceding cycle or not" Samek explicitly uses a bit flag (binary 0 or 1) for coded and non-coded (residual and weight update) parameters to be signaled to the decoder).
The combination of Liu and Gajjala as well as Samek are directed towards federated learning. Therefore, the combination of Liu and Gajjala as well as Samek are analogous art in the same field of endeavor. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of the combination of Liu and Gajjala with the teachings of Samek by using Samek's coding-loss-aware encoder/decoder as the encoder Q in the combination of Liu and Gajjala with the predictable result of reduced communication bits while maintaining convergence behavior. Samek provides as additional motivation for combination ([¶0169] "The embodiments described herein may be used in such framework for DDL and may extend past approaches in a manner so as to improve the communication-efficiency in distributed training. Compression at both up- and download was involved and efficient encoding and decoding of the compressed data has been featured."). This motivation for combination also applies to the remaining claims which depend on this combination.
Regarding claim 9, the combination of Liu and Gajjala teaches The apparatus of claim 8.
However, the combination of Liu and Gajjala doesn't explicitly teach, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: define a decoding flag comprising a value based on an encoding flag received from an encoder;
wherein the decoding flag comprises a value of 1 when the bitstream comprises the encoded prediction residuals, and the decoding flag comprises a value of 0 when the bitstream comprises the encoded weight updates.
Samek, in the same field of endeavor, teaches the instructions, when executed by the at least one processor, cause the apparatus at least to: define a decoding flag comprising a value based on an encoding flag received from an encoder;([¶0019] "based on the evaluation of the lossy coding of previous parameterization updates, it might be determined for each parameter of the parameterization, whether an update value is coded in a current parameterization update or not, i.e., is left uncoded. A flag may then be coded for each parameter to indicate whether for the respective parameter an update value is coded by the lossy coding by the current parameterization update, or not, and the flag may be coded using entropy coding using the determined probability of the respective parameter")
wherein the decoding flag comprises a value of 1 when the bitstream comprises the encoded prediction residuals, and the decoding flag comprises a value of 0 when the bitstream comprises the encoded weight updates.([¶0163] "using a flag or sign bit per coded update value" [¶0059] "The restriction does not significantly impact the learning convergence rate as non-transmitted update values due to being in the second but largest set of update values of opposite sign are likely to be transmitted in one of the cycles to come." [¶0166] " the information on which is also available to the corresponding probability estimation module 162′ at the receiver/decoder side. For instance, the probability estimation module 162 logs for each parameter 26 of parameterization 18, the membership of the corresponding coded value in the coded version 148 to the coded values 156 or the non-coded values 148, i.e., whether an update value is contained in the coded version 148 for the respective parameter 26 in a corresponding preceding cycle or not" Samek explicitly uses a bit flag (binary 0 or 1) for coded and non-coded (residual and weight update) parameters to be signaled to the decoder).
The combination of Liu and Gajjala as well as Samek are directed towards federated learning. Therefore, the combination of Liu and Gajjala as well as Samek are analogous art in the same field of endeavor. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of the combination of Liu and Gajjala with the teachings of Samek by using Samek's coding-loss-aware encoder/decoder as the encoder Q in the combination of Liu and Gajjala with the predictable result of reduced communication bits while maintaining convergence behavior. Samek provides as additional motivation for combination ([¶0169] "The embodiments described herein may be used in such framework for DDL and may extend past approaches in a manner so as to improve the communication-efficiency in distributed training. Compression at both up- and download was involved and efficient encoding and decoding of the compressed data has been featured."). This motivation for combination also applies to the remaining claims which depend on this combination.
Regarding claim 11, the combination of Liu and Gajjala teaches The apparatus of claim 8.
However, the combination of Liu and Gajjala doesn't explicitly teach wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: determine whether a prediction residual of a parameter has been skipped, in response to the bitstream comprising the encoded prediction residuals..
Samek, in the same field of endeavor, teaches the instructions, when executed by the at least one processor, cause the apparatus at least to: determine whether a prediction residual of a parameter has been skipped, in response to the bitstream comprising the encoded prediction residuals.([¶0019] " it might be determined for each parameter of the parameterization, whether an update value is coded in a current parameterization update or not, i.e., is left uncoded. A flag may then be coded for each parameter to indicate whether for the respective parameter an update value is coded" [¶0166] "a probability p(i) per parameter i of parameterization 18, that an update value ΔWk(i) for parameter i is comprised by the coded set of update values 156 or not (i.e. belongs to set 158)" [¶0099] "parameterization values left uncoded in the lossy coding 64 are deemed to be zero at the server." A parameter not being encoded is interpreted as synonymous with the residual of the parameter being skipped).
The combination of Liu and Gajjala as well as Samek are directed towards federated learning. Therefore, the combination of Liu and Gajjala as well as Samek are analogous art in the same field of endeavor. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of the combination of Liu and Gajjala with the teachings of Samek by using Samek's coding-loss-aware encoder/decoder as the encoder Q in the combination of Liu and Gajjala with the predictable result of reduced communication bits while maintaining convergence behavior. Samek provides as additional motivation for combination ([¶0169] "The embodiments described herein may be used in such framework for DDL and may extend past approaches in a manner so as to improve the communication-efficiency in distributed training. Compression at both up- and download was involved and efficient encoding and decoding of the compressed data has been featured."). This motivation for combination also applies to the remaining claims which depend on this combination.
Regarding claim 12, the combination of Liu, Gajjala, and Samek teaches The apparatus of claim 11, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: determine whether a previous weight update is available, in response to determining that the prediction residual of the parameter has been skipped; determine a current weight update to be a previous weight update, in response to determining that the previous weight update is available; and(Samek [¶0019] " it might be determined for each parameter of the parameterization, whether an update value is coded in a current parameterization update or not, i.e., is left uncoded. A flag may then be coded for each parameter to indicate whether for the respective parameter an update value is coded" [¶0166] "a probability p(i) per parameter i of parameterization 18, that an update value ΔWk(i) for parameter i is comprised by the coded set of update values 156 or not (i.e. belongs to set 158)" [¶0099] "parameterization values left uncoded in the lossy coding 64 are deemed to be zero at the server.")
determine the current weight update to be zero, in response to determining that the previous weight update is not available.(Samek [¶0099] "parameterization values left uncoded in the lossy coding 64 are deemed to be zero at the server.").
Regarding claim 13, the combination of Liu, Gajjala, and Samek teaches The apparatus of claim 11, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: determine whether a previous weight update is available, in response to determining that the prediction residual of the parameter has not been skipped;(Samek [¶0019] " it might be determined for each parameter of the parameterization, whether an update value is coded in a current parameterization update or not, i.e., is left uncoded. A flag may then be coded for each parameter to indicate whether for the respective parameter an update value is coded" [¶0166] "a probability p(i) per parameter i of parameterization 18, that an update value ΔWk(i) for parameter i is comprised by the coded set of update values 156 or not (i.e. belongs to set 158)" [¶0099] "parameterization values left uncoded in the lossy coding 64 are deemed to be zero at the server." A parameter not being encoded is interpreted as synonymous with the residual of the parameter being skipped)
determine a current weight update to be a previous weight update added to the prediction residual, in response to determining that the previous weight update is available; and(Samek [¶0095] "Each client, thus, completes the actual update of the parameterization setting download by internally updating the parameterization setting downloaded in the previous cycle with the currently downloaded parameterization update at 32 c such as, as depicted in FIG. 7, by adding the currently downloaded parameterization update downloaded in the current cycle to the parameterization setting Wi downloaded in the previous cycle" [¶0166] "the probability estimation module 162 logs for each parameter 26 of parameterization 18, the membership of the corresponding coded value in the coded version 148 to the coded values 156 or the non-coded values 148, i.e., whether an update value is contained in the coded version 148 for the respective parameter 26 in a corresponding preceding cycle or not")
determine the current weight update to be the prediction residual, in response to determining that the previous weight update is not available.(Liu [p. 5] "Input: Stepsizeα,β,γ,η, initializeh0=h0 i=0d,ˆx0 i=ˆx0" Algorithm 1 line 1 where at initialization (when a previous weight update is not available), the residual and weight update are equal (zero)).
Regarding claim 15, the combination of Liu, Gajjala, and Samek teaches The apparatus of claim 12, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: loop over a set of parameters to determine the current weight update for the set of parameters.(Liu [p. 4] "as the iteration approaches the optimum, hk i will also approach the local gradient ∇fi(x∗) rapidly which contributes to small gradient residual and consequently small compression variance" See also Algorithm 1).
Regarding claim 19, claim 19 is directed towards the method of claim 3. Therefore, the rejection applied to claim 3 also applies to claim 19.
Claims 4 and 20 are rejected under U.S.C. §103 as being unpatentable over the combination of Liu and Gajjala and Harviainan (US 20220005232 A1).
Regarding claim 4, the combination of Liu and Gajjala teaches The apparatus of claim 1.
However, the combination of Liu and Gajjala doesn't explicitly teach wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: determine whether an encoded prediction residual is lossy; determine whether an encoded weight update is lossy; and determine whether to signal to the decoder the first bitstream generated for the prediction residuals or the second bitstream generated for the weight updates, when the encoded weight update and/or the encoded prediction residual is lossy, based on at least one of: a first bitrate of the encoded prediction residual; a second bitrate of the encoded weight update; a first performance value computed based at least on a decoded prediction residual; or a second performance value computed based at least on a decoded weight update.
Harviainan, in the same field of endeavor, teaches the instructions, when executed by the at least one processor, cause the apparatus at least to: determine whether an encoded prediction residual is lossy; determine whether an encoded weight update is lossy; and determine whether to signal to the decoder the first bitstream generated for the prediction residuals or the second bitstream generated for the weight updates, when the encoded weight update and/or the encoded prediction residual is lossy, based on at least one of: a first bitrate of the encoded prediction residual; a second bitrate of the encoded weight update; a first performance value computed based at least on a decoded prediction residual; or a second performance value computed based at least on a decoded weight update.([¶0116] "If the point cloud data encoder process is lossy, it may be desirable to use the result of locally decoding the point cloud data (at module 1013) as input to the training module 1004 and to the neural network 1006. The weights output by the training process are encoded and written to the bitstream. A decoded version of the weights is used as input to the neural network. In the case of lossless coding of the weights, the output of the training process may be used directly, avoiding the need for an encode and decode cycle. The neural network is configured by the weights and given an input that includes the point cloud geometry data. The neural network produces predicted luma and chroma signals. Residual signals are produced by computing the difference between the original luma and predicted luma signal and also the difference between each original chroma component and corresponding predicted chroma component. The residual signals are then encoded into the bitstream. Techniques such as spatial transform coding may be applied to the residual signal as part of the residual encoding process.").
The combination of Liu and Gajjala as well as Harviainan are directed towards deep learning. Therefore, the combination of Liu and Gajjala as well as Harviainan are reasonably pertinent analogous art. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of the combination of Liu and Gajjala with the teachings of Harviainan by using a lossy encoder as the encoder Q in the combination of Liu and Gajjala and making a determination that the encoder is lossy. Harviainan provides as additional motivation for combination ([¶0114] “The colorized prediction may reduce the size of the color residual information, leading to improved compression. The color interpolator may be driven by geometry alone or by supplying a sparse set of color hints in addition to the geometry data. The colorizer model may be predefined or may be signaled via a neural network representation”). This motivation for combination also applies to the remaining claims which depend on this combination.
Regarding claim 20, claim 20 is directed towards the method performed by claim 4. Therefore, the rejection applied to claim 4 also applies to claim 20.
Claims 6 and 7 are rejected under U.S.C. §103 as being unpatentable over the combination of Liu and Gajjala and in further view of Adikari (“Compressing gradients by exploiting temporal correlation in momentum-SGD”, 2021).
Regarding claim 6, the combination of Liu and Gajjala teaches The apparatus of claim 1.
However, the combination of Liu and Gajjala doesn't explicitly teach, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: determine a first reconstruction accuracy of an encoded prediction residual;
determine a second reconstruction accuracy an encoded weight update;
signal the encoded prediction residual, in response to the first reconstruction accuracy being greater than the second reconstruction accuracy; and signal the encoded weight update, in response to the second reconstruction accuracy being greater than or equal to the first reconstruction accuracy.
Adikari, in the same field of endeavor, teaches The apparatus of claim 1, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: determine a first reconstruction accuracy of an encoded prediction residual;([p. 5] "compute a prediction of the momentum vector, compute the error in the prediction, and then input the vector of prediction errors to the quantizer" [p. 24] "In Fig. 8 we present a comparison of the loss and mean squared error 1 d et 2 with and without the Est-K predictor." Residual is prediction error at Eqn. 1c which is fed to encoder in Eqn. 1d and reconstructed in Eqn. 1e. MSE of reconstructed prediction error interpreted as first reconstruction accuracy.)
determine a second reconstruction accuracy an encoded weight update;([p. 11] "The Top-K scheme does not consist of a prediction component. Therefore, we set P to the zero function, i.e., ˆri t+1 = P(˜ri t) = 0" Plugging in ˆri t+1 into Eqn. 1f results in rit = uit (weight update only))
signal the encoded prediction residual, in response to the first reconstruction accuracy being greater than the second reconstruction accuracy; and signal the encoded weight update, in response to the second reconstruction accuracy being greater than or equal to the first reconstruction accuracy.([p. 24] "In Fig. 8 we present a comparison of the loss and mean squared error with and without the Est-K predictor. […] the prediction yields a reduction of the mean squared error close to two orders of magnitude" See also FIG. 8 where without Est-K Predictor is the weight update only and with Est-K).
The combination of Liu and Gajjala as well as Adikari are directed towards federated learning. Therefore, the combination of Liu and Gajjala as well as Adikari are analogous art in the same field of endeavor. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of the combination of Liu and Gajjala with the teachings of Adikari by modifying the compressor in the combination of Liu and Gajjala to use the error feedback in Adikari in order to reduce variance/entropy so that the compressed representation is cheaper. Adikari provides as additional motivation for combination ([p. 14 §IV] "The use of error-feedback by itself improves compression rates while keeping performance nearly the same as that realized without error-feedback"). This motivation for combination also applies to the remaining claims which depend on this combination.
Regarding claim 7, the combination of Liu and Gajjala teaches The apparatus of claim 1.
However, the combination of Liu and Gajjala doesn't explicitly teach, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: determine a first accuracy of a neural network on a validation dataset based on an encoded prediction residual; determine a second accuracy of the neural network on the validation dataset based on an encoded weight update;
signal the encoded prediction residual, in response to the first accuracy being greater than the second accuracy; and
signal the encoded weight update, in response to the second accuracy being greater than the first accuracy..
Adikari, in the same field of endeavor, teaches The apparatus of claim 1, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: determine a first accuracy of a neural network on a validation dataset based on an encoded prediction residual; determine a second accuracy of the neural network on the validation dataset based on an encoded weight update;([p. 5] "compute a prediction of the momentum vector, compute the error in the prediction, and then input the vector of prediction errors to the quantizer" [p. 14] "In Fig. 3 we compare Scaled-sign and Top-K quantizers with and without prediction. The experiments with the predictor easily outperforms the ones without the predictor" Residual is prediction error at Eqn. 1c which is fed to encoder in Eqn. 1d and reconstructed in Eqn. 1e. Accuracy of reconstructed prediction error interpreted as first accuracy of a neural network.)
signal the encoded prediction residual, in response to the first accuracy being greater than the second accuracy; and([p. 11] "The Top-K scheme does not consist of a prediction component. Therefore, we set P to the zero function, i.e., ˆri t+1 = P(˜ri t) = 0" Plugging in ˆri t+1 into Eqn. 1f results in rit = uit (weight update only))
signal the encoded weight update, in response to the second accuracy being greater than the first accuracy.([p. 14] "In Fig. 3 we compare Scaled-sign and Top-K quantizers with and without prediction. The experiments with the predictor easily outperforms the ones without the predictor" See also FIG. 3).
The combination of Liu and Gajjala as well as Adikari are directed towards federated learning. Therefore, the combination of Liu and Gajjala as well as Adikari are analogous art in the same field of endeavor. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of the combination of Liu and Gajjala with the teachings of Adikari by modifying the compressor in the combination of Liu and Gajjala to use the error feedback in Adikari in order to reduce variance/entropy so that the compressed representation is cheaper. Adikari provides as additional motivation for combination ([p. 14 §IV] "The use of error-feedback by itself improves compression rates while keeping performance nearly the same as that realized without error-feedback"). This motivation for combination also applies to the remaining claims which depend on this combination.
Claim 14 is rejected under U.S.C. §103 as being unpatentable over the combination of Liu, Gajjala, Samek, and Ozfatura (“DISTRIBUTED SPARSE SGD WITH MAJORITY VOTING”, 2020).
Regarding claim 14, the combination of Liu and Gajjala teaches The apparatus of claim 8, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: (Liu [p. 7 §5.2] "we use 1 parameter server and 10 workers, each of which is equipped with an NVIDIA Tesla K80 GPU.")
determine a weight update of a neural network parameter tensor, wherein the weight update of the neural network parameter tensor comprises one or more values;(Liu [p. 1] "the master node further takes the gradient descent step with the averaged gradient and broadcasts the new model parameter xk+1 to all worker nodes" [p. 3] "A vector x [...] a vector x ∈ Rd can be further decomposed into blocks, i.e., x = (x(1) ,x(2) ,··· ,x(m)) with x(l) ∈ Rdl and m l=1 dl = d, and the blocks can be compressed independently" [p. 7] "Figures 4 and 5 show the training loss and test loss for each epoch during the training of LeNet on the MNIST dataset and Resnet18 on CIFAR10 dataset" In Liu the weight parameter updates are explicitly parameter updates of vector/tensor xk+1 in Rd).
However, the combination of Liu and Gajjala doesn't explicitly teach determine whether at least one value of the one or more values of the weight update of the neural network parameter tensor is nonzero
update the second parameter update tree corresponding to the weight update of the parameter, in response to determining that there is at least one nonzero value of the one or more values of the weight update of the neural network parameter tensor; and
skip the neural network parameter tensor with removing the neural network parameter tensor from a list of neural network parameter tensors, in response to determining that each value of the one or more values of weight update of the neural network parameter tensor is zero.
Samek, in the same field of endeavor, teaches determine whether at least one value of the one or more values of the weight update of the neural network parameter tensor is nonzero ([¶0019] " it might be determined for each parameter of the parameterization, whether an update value is coded in a current parameterization update or not, i.e., is left uncoded. A flag may then be coded for each parameter to indicate whether for the respective parameter an update value is coded" [¶0166] "a probability p(i) per parameter i of parameterization 18, that an update value ΔWk(i) for parameter i is comprised by the coded set of update values 156 or not (i.e. belongs to set 158)" [¶0099] "parameterization values left uncoded in the lossy coding 64 are deemed to be zero at the server." A parameter not being encoded is interpreted as synonymous with the residual of the parameter being skipped)
update the second parameter update tree corresponding to the weight update of the parameter, in response to determining that there is at least one nonzero value of the one or more values of the weight update of the neural network parameter tensor; and([¶0099] "The upload of the parameterization update as transmitted by the client i at 36 a is completed by the reception at the server at 36 b. As just-described: parameterization values left uncoded in the lossy coding 64 are deemed to be zero at the server.").
The combination of Liu and Gajjala as well as Samek are directed towards federated learning. Therefore, the combination of Liu and Gajjala as well as Samek are analogous art in the same field of endeavor. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of the combination of Liu and Gajjala with the teachings of Samek by using Samek's coding-loss-aware encoder/decoder as the encoder Q in the combination of Liu and Gajjala with the predictable result of reduced communication bits while maintaining convergence behavior. Samek provides as additional motivation for combination ([¶0169] "The embodiments described herein may be used in such framework for DDL and may extend past approaches in a manner so as to improve the communication-efficiency in distributed training. Compression at both up- and download was involved and efficient encoding and decoding of the compressed data has been featured.").
However, the combination of Liu, Gajjala, and Samek does not explicitly teach skip the neural network parameter tensor with removing the neural network parameter tensor from a list of neural network parameter tensors, in response to determining that each value of the one or more values of weight update of the neural network parameter tensor is zero.
Ozfatura, in the same field of endeavor, teaches skip the neural network parameter tensor with removing the neural network parameter tensor from a list of neural network parameter tensors, in response to determining that each value of the one or more values of weight update of the neural network parameter tensor is zero([p. 2] "only the non-zero values in ˜g are transmitted to the parameter server" [p. 5] "at each iteration, each worker sends the list of indices to be removed/added to the previously voted indices. We use Kad to denote the maximum number of changes allowed during voting; that is, at each iteration at most Kad number of indices can be removed/added to the voting list by each worker").
The combination of Liu, Gajjala, and Samek as well as Ozfatura are directed towards federated learning. Therefore, the combination of Liu, Gajjala, and Samek as well as Ozfatura are analogous art in the same field of endeavor. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of the combination of Liu, Gajjala, and Samek with the teachings of Ozfatura by using a sparse list to track the model parameter updates. Ozfatura provides as additional motivation for combination ([p. 3] “we observe that sparse communication can also result in improved accuracy results; hence, we argue that a sparse communication strategy that is aligned with the sparse nature of the NN architecture may help to achieve better generalization”).
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SIDNEY VINCENT BOSTWICK whose telephone number is (571)272-4720. The examiner can normally be reached M-F 7:30am-5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571)270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/SIDNEY VINCENT BOSTWICK/Examiner, Art Unit 2124
/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124